Advanced search

Message boards : Graphics cards (GPUs) : Lot of computation errors on all CUDA apps

Author Message
The Brain QC
Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 15062 - Posted: 7 Feb 2010 | 20:32:09 UTC
Last modified: 7 Feb 2010 | 20:35:59 UTC

Hi guys, the title says it all, all my Cuda apps are crashing while computing.

I checked all my temps and i benchmarked my graphic card to see if i had a temperatures problem. No problem with my 9800 GX2 (running 80°C at 100 %usage), all is ok but boinc cuda computation.

I'm running Seven 64 bits and 196.21 Gpu Drivers under 6.10.18 Boinc Manager.

If someone could give me some help, thank you by advance (sorry for my poor english).

Apps which crash :
- Seti Cuda 100% crashing (no wu finished)
- Einstein Cuda 100% crashing (crash at the moment a second Gpu is active)
- GpuGrid 85% crashing Some wu 100% completed, not able to compute on both gpu, if i do so i get compute errors on all wus.

Profile liveonc
Avatar
Send message
Joined: 1 Jan 10
Posts: 292
Credit: 41,567,650
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwat
Message 15063 - Posted: 7 Feb 2010 | 20:47:00 UTC - in response to Message 15062.

Did you enable SLI+ I don't own either a an 9800 GX2 or GTX295, but I tried running my 2 260GTX in SLI once before I read that it doesn't support that. I got at that time 3 errors out of every 4 WU.
____________

The Brain QC
Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 15065 - Posted: 8 Feb 2010 | 4:01:51 UTC - in response to Message 15063.

I disabled the Sli Switch and i'll give it a try for the next 48 Hours, thx for reply and help.

Jeremy
Send message
Joined: 15 Feb 09
Posts: 55
Credit: 3,542,733
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 15079 - Posted: 8 Feb 2010 | 23:31:07 UTC - in response to Message 15065.
Last modified: 8 Feb 2010 | 23:31:48 UTC

Your CPU overclock is a bit aggressive. Does it pass Linpack/Intel Burn Test stress testing? I had issues with my overclock that only showed in GPU apps for some reason. Double check your system stability if the SLi switch doesn't do anything for you. You need to be able to run FurMark and Intel Burn Test at the same time for 10-15 minutes without errors in either. Errors may not show in games or other BOINC apps, but it'll show in the CUDA apps.

I had to throw a little more voltage at my CPU to get the whole thing 100% stable.
____________
C2Q, GTX 660ti

The Brain QC
Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 15108 - Posted: 9 Feb 2010 | 21:09:59 UTC - in response to Message 15079.

Thank you Jeremy, my system is not as much overclocked than displayed (3.6Ghz), and all is tested "rockstable" since a lot of time. I did CPU burn tests, Gpu Burn tests, memory tests. All seems to be ok OCed or not, same computation errors with non overclocked system.

I just don't understand what happen, never had this problem before switching from vista 64 to seven 64. Maybe Gpu drivers related, i just don't know.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15113 - Posted: 10 Feb 2010 | 0:50:18 UTC - in response to Message 15108.
Last modified: 10 Feb 2010 | 0:52:42 UTC

Perhaps one of your settings has changed, or needs to change?

Check that you have disabled "Use GPU When Computer is in Use", and make sure you use "Leave Applications in Memory While Suspended".
- Boinc Manager, Advanced View, Advanced Preferences, Processor Usage, and then Disk and Memory Usage.

Most of your failures are with Beta tasks, and many are after a short time, so it looks worse than it is.

Are you running any other GPU tasks? If so, stick to one at a time (Especially for Beta tests)! You might even want to disable Betas in your online preferences.

One last thing to try. Set Boinc to use 75% of the CPU's rather than 100% (only applicable if you are performing CPU tasks as well as GPUGrid tasks).

Good Luck,

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 15118 - Posted: 10 Feb 2010 | 13:21:27 UTC - in response to Message 15113.

I noticed you have 196.21 loaded. NVidia acknowledged a bug in 196.21 within days of release, that prevented overclocking - it certainly stopped my 9800GTX in its tracks, although it did not seem to affect everyone, it was (is) widespread. It maybe that some sideffect of that is causing you issues, pure guess of course, but given the nature of the bug affecting o/c, its not impossible.

They released 196.34 days afterwards, a Beta release, but the only change being to fix the o/c bug 196.21 WQHL driver. Maybe worth a shot.

Regards
Zy

Jeremy
Send message
Joined: 15 Feb 09
Posts: 55
Credit: 3,542,733
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 15119 - Posted: 10 Feb 2010 | 15:19:55 UTC - in response to Message 15118.

Just to clarify, that bug only affects software overclocks. If the overclock is burned into the BIOS the card won't be affected by that bug.
____________
C2Q, GTX 660ti

Profile liveonc
Avatar
Send message
Joined: 1 Jan 10
Posts: 292
Credit: 41,567,650
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwat
Message 15123 - Posted: 10 Feb 2010 | 17:14:25 UTC - in response to Message 15119.
Last modified: 10 Feb 2010 | 17:15:11 UTC

I must agree that 3.6Ghz is a bit agressive on a Q6600. Especially for 24/7 use. I only have mine up to 3.0Ghz. I've OC'd Q6600 for 24/7 up to 3.2Ghz, but unless UR liquid, I don't see how it's going to be a good idea to OC so much & the 9800 GX2. I got errors & lots of them, when I OC'd a 8800GT to more than Core Clock 700MHz (vs. 600MHz standard) Shader Clock 1728MHz (vs. 1500MHz standard) & 2000MHz (vs. 1800MHz standard).
____________

The Brain QC
Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 15126 - Posted: 10 Feb 2010 | 19:31:07 UTC - in response to Message 15123.

Thank You all for answers.

Liveonc, my GPU isn't OCed, and my CPU is very well watercooled with 1st choice watercooling parts. With OCCt test that makes your CPU really burn (lingo test i think), my CPU T° stays, for the hotest core, at 57°C, wich is very well under such burning test.

I really don't think that's it's CPU related, but i made a new machine for test purpose, i investigate where is the problem.

At this moment, i can say that my doubts about graphic drivers issue under Seven 64 seem to be the problem. I reinstall a ghost of vista and no more errors with this OS.

The Brain QC
Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 15681 - Posted: 10 Mar 2010 | 23:22:53 UTC - in response to Message 15126.

I found the answer to my question about such compute errors.


Lesson carefully owners of double gpu video cards.

The problem seems to be drivers related. GPU compute apps and boinc detect two gpu but seem to give same gpu two jobs, this makes crash one of the two jobs.

To prevent such crash, go to nvidia panel and deactive Sli switch, then, Boinc still recognize two GPUs and driver will give each job on each gpu chip instead of two jobs on same gpu.

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15689 - Posted: 11 Mar 2010 | 15:35:00 UTC

I have always :

11/03/2010 16:00:16 GPUGRID Output file g106-TONI_CAPBIND99SB-16-100-RND9786_0_1 for task g106-TONI_CAPBIND99SB-16-100-RND9786_0 absent
11/03/2010 16:00:16 GPUGRID Output file g106-TONI_CAPBIND99SB-16-100-RND9786_0_2 for task g106-TONI_CAPBIND99SB-16-100-RND9786_0 absent
11/03/2010 16:00:16 GPUGRID Output file g106-TONI_CAPBIND99SB-16-100-RND9786_0_3 for task g106-TONI_CAPBIND99SB-16-100-RND9786_0 absent

[boinc.at] Nowi
Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 15691 - Posted: 11 Mar 2010 | 16:45:23 UTC

I have the same problem here, but starting with collartz wu. I have two GPU (GTX260, 8800GT and so no SLI) and I get this error for all GPU-tasks from every project. It seems that BOINC coudn´t find a CUDA device. This happened without any interaction with the computer (user was miles away :-)). Is it BOINC- or driver related?

A reboot will solve the problem!

BOINC 6.10.36 NVIDIA 196.21 WIN7 64 Bit

DataC
Avatar
Send message
Joined: 15 Feb 10
Posts: 9
Credit: 16,891,220
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 15978 - Posted: 25 Mar 2010 | 19:15:45 UTC

Hi. I was told to post here, but ive seen many other people in this
forum who have had similar problems and aparently never got it resolved...

Every work unit that downloads fails in 10 seconds or less. None of my
other cuda apps crash or fail, and I have tried just about every version
of my video card driver that I can find--notta.

If anyone has any ideas, PLEASE let me know. Here are my system specs:


OS : Windows 7/Windows Server 2008 R2 Version 6.1 Build 7600

Number of processors : 2
Processor type : x86 Intel Pentium, Level 6, Revision 5898
CPU speed: 2898 MHz
Total physical memory: 4124988 KB
Available physical memory: 3037680 KB
Total virtual memory: 2097024 KB
Available virtual memory: 1940944 KB

Number of CUDA devices : 1
Device 0 : GeForce 9800 GT

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 15983 - Posted: 26 Mar 2010 | 8:47:15 UTC - in response to Message 15978.

Try to install the 196.75 driver.


gdf

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16017 - Posted: 27 Mar 2010 | 22:08:31 UTC
Last modified: 27 Mar 2010 | 23:51:15 UTC

I had a couple die on me this morning.

# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 30
# Number of cores: 240
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Cuda driver error 3e7 in file '../swan/swanlib_nv.cpp' in line 186.

GDF have you guys tried the 197.13 drivers? Is there any point in updating to them? Currently I am running 196.21.

Now the NDA with nvidia has expired is it possible to use the cuda 3.0 DLL's in the faint hope they will fix something?
____________
BOINC blog

Post to thread

Message boards : Graphics cards (GPUs) : Lot of computation errors on all CUDA apps

//