Advanced search

Message boards : Graphics cards (GPUs) : Work units failing in 64-bit Linux

Author Message
Greg
Send message
Joined: 20 Nov 08
Posts: 3
Credit: 2,670,125
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12638 - Posted: 22 Sep 2009 | 23:38:44 UTC
Last modified: 22 Sep 2009 | 23:41:30 UTC

Work units are failing on 64-bit linux. This is a research unit that has proven itself to be very stable, but is idle at the moment. Thought I'd try to take advantage of the idle time.

See, for example,
http://www.gpugrid.net/result.php?resultid=1291252

The driver version is cudadriver_2.3_linux_64_190.16

Info about the setup:

CUDA Device Query (Runtime API) version (CUDART static linking)
There are 4 devices supporting CUDA

Device 0: "Tesla C1060"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: "Tesla C1060"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: "Tesla C1060"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 3: "Tesla C1060"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 4294705152 bytes
Number of multiprocessors: 30
Number of cores: 240
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Test PASSED

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 12668 - Posted: 23 Sep 2009 | 7:33:57 UTC - in response to Message 12638.

Do you always fail or just sometime?

The error type that you get is usually given by too hot GPUs. Being your GPUs a tesla I am quite surprised.

gdf

Greg
Send message
Joined: 20 Nov 08
Posts: 3
Credit: 2,670,125
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12696 - Posted: 24 Sep 2009 | 0:55:04 UTC - in response to Message 12668.
Last modified: 24 Sep 2009 | 1:01:40 UTC

The first eight work units failed, so I suspended it at that point. It's definitely not overheating. It's housed in 1u rack unit in a frigid, air conditioned room. I wonder whether it has something to do with

# Total amount of global memory: -262144 bytes

Greg
Send message
Joined: 20 Nov 08
Posts: 3
Credit: 2,670,125
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12697 - Posted: 24 Sep 2009 | 1:24:13 UTC - in response to Message 12696.

Found the problem. The Einstein@Home beta app was the culprit. I had left something around that GPUGrid didn't like. Unloading and reloading the kernel module fixed the problem.

Post to thread

Message boards : Graphics cards (GPUs) : Work units failing in 64-bit Linux

//