Advanced search

Message boards : Number crunching : SANTI WU Killed My GPU

Author Message
tomba
Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33832 - Posted: 11 Nov 2013 | 8:48:21 UTC

Win7 Home, ASUS GTX 660, GPUGrid 24/7:

Overnight my PC died. Restarted. When BOINC came up I got three black screens, a beep and a STOP 116. Restarted in safe mode w/ networking. Found the STOP was about the GPU. Unchecked the two BOINC items in msconfig and did a normal boot. All was well.

Started boincmgr manually. It died in the same way. Rebooted. Found about 15 files relating to the current WU and deleted them. Started boincmgr. The current WU was still there but I just had time to do a suspend. Immediately the WU errored and I got a new WU. All is now working properly.

Here is the offending WU. What the heck happened??

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33843 - Posted: 12 Nov 2013 | 15:23:11 UTC - in response to Message 33832.
Last modified: 12 Nov 2013 | 15:23:34 UTC

I don't know what happened, but I have seen that many many times at my 660 and Santi SR's. So I do only LR's now on the 660. However today I had also a Fatal cuda driver error. But in my case it only resulted in the GPU-clock to down clock. I did a reboot and it is okay again.
It seems that in your case the error was not properly handled and BOINC came in a sort of loop. Luckily you where able to find the offending files and deleted them.
____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 806
Credit: 1,561,695,471
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33844 - Posted: 12 Nov 2013 | 15:36:22 UTC - in response to Message 33832.

Here is the offending WU. What the heck happened??

I see you are still on the 327.23 drivers. Why not try 331.65? I think they implement a later version of CUDA, and work fine on my GTX 660s on the Longs, including many Santis.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33855 - Posted: 12 Nov 2013 | 22:06:05 UTC

Sounds familiar to this problem, doesn't it?

MrS
____________
Scanning for our furry friends since Jan 2002

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33864 - Posted: 13 Nov 2013 | 16:16:31 UTC - in response to Message 33855.
Last modified: 13 Nov 2013 | 16:16:43 UTC

Sounds familiar to this problem, doesn't it?

MrS

I don't think so, as there was no power outage in my case, I can not speak for Tomba.
And even when one of my PSU burnt down and the main fuse went off, and all PC's where abruptly shut down, the GPUGRID WU did start nicely without problems after I reboot the systems.
So I think this is caused by something in the Santi WU's as I have seen it only with these. And I watch my systems closely.
____________
Greetings from TJ

Storm
Send message
Joined: 9 Jul 10
Posts: 1
Credit: 296,167,092
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 33869 - Posted: 14 Nov 2013 | 6:30:25 UTC

In the last 24 hours I have had a very similar issue where the nVidia kernel keeps crashing when doing GPU units.
Today I've removed the nVidia drivers and cleaned by system of any profile or configs left from them. Then reloaded the latest 331.65 drivers for my dual 670 GTX in SLI mode and suspended all GPU jobs.
Then one at a time I re enabled them (only running GPU for S@H and GPUGRID), the S@H ones have been for now for over an hour. I then disabled S@H and tried it with GPUGRID, and within seconds the screen locks and then reports the kernel crash from nVidia.
Thankfully I could re-suspend the GPU and gain control again. I suspended the active GPU unit and allowed another one to start working and it has locked up since. I've now aborted the offending unit.
Like the title of this thread it was a SANTI that was causing the issue, more precisely this one : I223-SANTI_baxbim2-21-32-RND7738_0
http://www.gpugrid.net/workunit.php?wuid=4919966

The now interesting thing is I have another SANTI one working just fine : I104-SANTI_baxbim2-23-32-RND3377_0, but we'll wait and see over the next 24 hours.

Storm

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2334
Credit: 16,178,080,749
RAC: 70,291
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33870 - Posted: 14 Nov 2013 | 9:33:02 UTC - in response to Message 33869.

Storm wrote:
Today I've removed the nVidia drivers and cleaned by system of any profile or configs left from them. Then reloaded the latest 331.65 drivers for my dual 670 GTX in SLI mode and suspended all GPU jobs.

FYI: SLI is not recommended while crunching on the GPUs.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33873 - Posted: 14 Nov 2013 | 9:58:22 UTC

New Beta app coming later today that should help with this.

MJH

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33979 - Posted: 22 Nov 2013 | 14:40:04 UTC

Yesterday another Santi LR resulted in this error: SWAN : FATAL : Cuda driver error 715 in file 'swanlibnv2.cpp' in line 1969.
And as a result the GPU clock was downclocked so I needed to reboot again.
No power failure or power outing, power cuts. It is something with Santi's WU and a GTX660, as I have not yet seen this error on my GTX770, running since August.
____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 806
Credit: 1,561,695,471
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33980 - Posted: 22 Nov 2013 | 15:36:36 UTC - in response to Message 33979.
Last modified: 22 Nov 2013 | 15:58:10 UTC

It is something with Santi's WU and a GTX660, as I have not yet seen this error on my GTX770, running since August.

Maybe I have just been lucky, but ever since implementing my under-clocking and over-volting fix
http://www.gpugrid.net/forum_thread.php?id=3466&nowrap=true#33677, I have not had an error, or even an instance of slow-running.

The GTX 660s may be susceptible to problems if they are factory overclocked more for example, or don't have as large heatsinks as the others relative to their heat output. But it appears that they can be fixed, and there is nothing inherently wrong with the chip itself for running the current work units.

By the way, I now suspect that the slow-running is triggered by hitting against the GPU power limit, which probably causes the self-protective circuity in the chip to reduce its clock rate. Then, it never resumes the higher rate until you reboot it. Increasing the power limit helps avoid this problem, as long as you monitor the resulting temperature (the limit is there for a reason).

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33981 - Posted: 22 Nov 2013 | 18:18:08 UTC - in response to Message 33980.

My 660 is not factory overclocked and I have set the clock speed little lower than what it would be without intervention.
Temperature is okay I think as it runs at 63°C evenly.
And also my 660 has run 21 days continuously without any error, so indeed as you say Jim the 660 is okay for this project.
____________
Greetings from TJ

Jim1348
Send message
Joined: 28 Jul 12
Posts: 806
Credit: 1,561,695,471
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33982 - Posted: 22 Nov 2013 | 19:54:18 UTC - in response to Message 33981.

Yes, you look good.

lukeu
Send message
Joined: 14 Oct 11
Posts: 29
Credit: 72,875,504
RAC: 180
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33984 - Posted: 23 Nov 2013 | 0:07:56 UTC

Just chipping in with a "me too" post:

GTX 660, 311.06 drivers, 112x-SANTI_MAR419cap-0-8-RND0309

Driver crashed about 4 times, followed by spontaneous reboot. It did that twice before I was able to locate and terminate this WU through the BOINC GUI. (GPU was underclocked during the 2nd attempt.)

Profile Coleslaw
Send message
Joined: 24 Jul 08
Posts: 36
Credit: 232,947,380
RAC: 1,650
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33985 - Posted: 23 Nov 2013 | 0:58:45 UTC

I have a box that has two GT 430's that are having this problem.

Q8200 CPU
4GB Ram
Win 7 Pro x64 bit

Other GPU projects run fine. I also have a BETA from GPUGrid running on it OK too. It is when I resume the SANTI work units that I get a BSOD after drivers crash a few times. This happened with Drivers Ver. 331.82

And of course I kill the BETA by forgetting to exit BOINC when I'm running some updates.
____________

Post to thread

Message boards : Number crunching : SANTI WU Killed My GPU

//