Advanced search

Message boards : Number crunching : Failed again

Author Message
[B^S] HenryHunter
Send message
Joined: 27 Dec 08
Posts: 4
Credit: 6,055,773
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 13555 - Posted: 15 Nov 2009 | 0:56:14 UTC

<core_client_version>6.6.38</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.65 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.65 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 14
# Number of cores: 112
Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unknown error.

</stderr_txt>
?

[B^S] HenryHunter
Send message
Joined: 27 Dec 08
Posts: 4
Credit: 6,055,773
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 13558 - Posted: 15 Nov 2009 | 9:09:01 UTC - in response to Message 13555.

Another failure:
<core_client_version>6.6.38</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.65 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.65 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 14
# Number of cores: 112
Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unknown error.

</stderr_txt>
]]>

Profile Damaraland
Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 13920 - Posted: 14 Dec 2009 | 9:46:18 UTC - in response to Message 13558.

I keep having this error on every unit.
What should I do?
<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Funci�n incorrecta. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.41 GHz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Device 1: "GeForce 9800 GT"
# Clock rate: 1.37 GHz
# Total amount of global memory: 1073545216 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"

</stderr_txt>
]]>
<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Funci�n incorrecta. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.41 GHz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Device 1: "GeForce 9800 GT"
# Clock rate: 1.37 GHz
# Total amount of global memory: 1073545216 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79 : unspecified launch failure.

</stderr_txt>
]]>

Profile Damaraland
Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 13923 - Posted: 14 Dec 2009 | 14:48:17 UTC - in response to Message 13920.

More and more...
task 1629422
task 1629408
task 1629383
task 1629337
task 1629322

Tonight I suspend proyect until further news. If admins what me to report any further information, I'll gladly help.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13929 - Posted: 14 Dec 2009 | 20:41:59 UTC - in response to Message 13923.

You should do this,

Install and run GPU-Z. It is freeware and will allow you to see the temperatures of the GPUS. If they are over 70degrees when running GPUGrid tasks, you may have a heating / ventilation problem.
If you do then you could test this by just leaving the door off for a while and try running more tasks and check the temperatures. If this is definitely a problem, either get a couple of extra system fans or manually turn the fan speed up on the card(s).

I would highly recommend that you uninstall the 9800GT. These cards use a G92 core and do not handle most of today’s tasks too well, especially hERG tasks!

This card is most likely causing ALL your failures.

If you do not then the card will eventually give so many failures that you will get no new work. Your GTX260 will be doing at least 3 times the work of the 9800GT.

What are the GPU temperatures like with and without the 9800GT installed,
Running tasks and not running tasks?

What is your PSU?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,794,611,851
RAC: 9,297,069
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13934 - Posted: 14 Dec 2009 | 22:08:55 UTC - in response to Message 13923.

More and more...
task 1629422
task 1629408
task 1629383
task 1629337
task 1629322

Tonight I suspend proyect until further news. If admins what me to report any further information, I'll gladly help.

When your card fails task after task in 10 seconds or less, it's likely that its internal state has become corrupted. Do a complete power cycle - power down the host and restart: that should allow most of those tasks (the GIANNI-BIND and the KASHIF-HIVPR, at least) to run properly - even on the 9800GT that SKGiven despises so much :-)

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 115,639
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 13937 - Posted: 14 Dec 2009 | 23:56:03 UTC - in response to Message 13929.

Install and run GPU-Z. It is freeware and will allow you to see the temperatures of the GPUS. If they are over 70degrees when running GPUGrid tasks, you may have a heating / ventilation problem.


That's good advice for the CPU.

The GPU is a different beast altogether.

Newer GPU's appear to be designed to run hotter. A *LOT* hotter. The latest generation of Nvidia GPUs (260, 280, 285, 295) actually are designed to safely run to -- get this -- just over 100 degrees Celsius. In fact, when the fan is left on automatic, it won't even start to ramp up the fan speed until the temperature exceeds 70. Even then, it's only slightly increasing the fan speed and is clearly not trying to keep temps in the 70s

Normal operating temperature under a heavy load seems to be around 80 to 85 degrees -- and that's with the fan still below 60% on automatic control.

So don't panic if you see GPU temps in the 70s or 80s. That's normal.

On the other hand, if your *CPU* is running that hot, you're possibly just on the brink of having it fail, depending on the model.


____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13956 - Posted: 15 Dec 2009 | 20:26:43 UTC - in response to Message 13937.
Last modified: 15 Dec 2009 | 21:02:12 UTC

Install and run GPU-Z. It is freeware and will allow you to see the temperatures of the GPUS. If they are over 70degrees when running GPUGrid tasks, you may have a heating / ventilation problem.


That's good advice for the CPU.

The GPU is a different beast altogether.

So don't panic if you see GPU temps in the 70s or 80s. That's normal.


OK, GPU-Z measures GPU temperatures. It is not CPU-Z!

Although the cards can run hot, it is NOT Good for them to stay at that temperature for extended periods of time and certainly Not Normal!

My GTX260-216sp sits at 66 Degrees C when crunching GPUGrid and the fan is at 40%. That is my Normal and the card works 100% for ALL tasks. I would not like to hear the fan at 60% or run the GPU at 80 Degrees C.

Pull the 9800GT
You are testing to find the problem, so trying things is important!

Profile Damaraland
Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 14057 - Posted: 25 Dec 2009 | 15:39:20 UTC - in response to Message 13956.
Last modified: 25 Dec 2009 | 15:40:21 UTC

Sure it wasn't the heat, because since I thought that it was in a very cold room with the mainboard without case.
I had the GT in the same board with the GTX card. I moved the gtx to a brand new board and everything is fine since then... the wonderfull world of computers...

Now having problems with Linux drivers with my GTX, but that's another story I'll handle with time and patience...

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14062 - Posted: 26 Dec 2009 | 14:38:05 UTC - in response to Message 14057.

A pain I’m sure.
It ran for some time before it failed!
Not sure a 9500 GT is up to the task any more. Well, certainly not that task.
Was there a system restart, an update, or did you clear the cache, wipe free space...? Just on the off chance it is something other than Boinc/GPUGrid!

I tried my 8800GTS again, but all tasks failed. 4 failed within 3sec (no real problem) but the fifth failed after 13h or so. It is running another task. If it fails I will take it back off GPUGrid, but I might get another GT240 instead. The one I have uses less electric and has successfully finished all 11 tasks so far.

Post to thread

Message boards : Number crunching : Failed again

//