Advanced search

Message boards : Graphics cards (GPUs) : Eight compute errors in a row

Author Message
anthonmg
Send message
Joined: 11 Apr 09
Posts: 17
Credit: 11,086,149
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10177 - Posted: 26 May 2009 | 6:14:43 UTC

So, after finally getting the server to send me work, the next units go down with compute errors. They all failed in <5-10 seconds after starting and gave me an on-screen error message. I didn't get a chance to copy it down but an exe was failing to start. The units came from a few different subprojects (KASHIF, IBUCH, etc.). After running so smoothly for over a month it's weird to suddenly have so many problems. I'll try to capture the error when the server tries to send more work units. It won't now saying i exceeded my quota and that the server has no work.

Does anyone know if BOINC has another GPU intensive application other than SETI? Milkyway never seems to have any work. any other projects?

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10186 - Posted: 26 May 2009 | 12:39:47 UTC - in response to Message 10177.

So, after finally getting the server to send me work, the next units go down with compute errors. They all failed in <5-10 seconds after starting and gave me an on-screen error message. I didn't get a chance to copy it down but an exe was failing to start. The units came from a few different subprojects (KASHIF, IBUCH, etc.). After running so smoothly for over a month it's weird to suddenly have so many problems. I'll try to capture the error when the server tries to send more work units. It won't now saying i exceeded my quota and that the server has no work.

Does anyone know if BOINC has another GPU intensive application other than SETI? Milkyway never seems to have any work. any other projects?

At the moment there are only three GPU Grid, SaH, and SaH Beta that have a "steady" supply of work.

The Lattice Project, Ramsey, Aqua, and Milky Way are all in the start up phase with unknown work availability. Aqua just came on line in the last 24 hours. I have not tried them yet but they have been having problems so don't know their status.

Milky Way is just getting going so they do not yet have work.

The Lattice Project always has been intermittent with work and they had a limited CUDA test but I don't know if they are issuing work at this time.

Einstein said they were on the verge of a CUDA release, but, have not said anything positive about it in a month or so ...

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10189 - Posted: 26 May 2009 | 13:36:00 UTC

Well there is another project "folding@home" , they make their own application and runs on many videocards and does support also smp meaning 4 cpu can work on 1 unit, i am not sure if the gpu-client also uses the cpu's but it probably will.
The client runs on ATI,Nvidia GPU, SMP and normal so in anyway impressive build sadly they turned their back towards boinc, so its not a boinc project anymore.
They reported boinc a much to unstable platform :D

anthonmg
Send message
Joined: 11 Apr 09
Posts: 17
Credit: 11,086,149
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10195 - Posted: 26 May 2009 | 18:30:29 UTC - in response to Message 10189.

Yeah, and the last time I tried to install the GPU verison of the Folding At Home client (few weeks ago) it was an impressively complicated install. I've been using computers since I was 4 so I know I can do it, but the effort involved and the amount of changes made to my computer, and this rig is actually important for some other things, means I"m not gonna do it.

Hmm, I'm still getting the No-work available error from GPUgrid so can't tell if the situation is fixed or not.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10199 - Posted: 26 May 2009 | 18:46:32 UTC - in response to Message 10195.

Yeah, and the last time I tried to install the GPU verison of the Folding At Home client (few weeks ago) it was an impressively complicated install. I've been using computers since I was 4 so I know I can do it, but the effort involved and the amount of changes made to my computer, and this rig is actually important for some other things, means I"m not gonna do it.

Hmm, I'm still getting the No-work available error from GPUgrid so can't tell if the situation is fixed or not.

Youwill likely get that message for 24 hours ... try Aqua them may have their issues worked out. I have not tried it myself, but, you may have nothing to lose but some time ...

There are a couple threads on the boards about the CUDA experience there though they don't have much in them yet ... be the first ... start a fashion ... :)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10207 - Posted: 26 May 2009 | 21:06:01 UTC
Last modified: 26 May 2009 | 21:08:58 UTC

anthonmg,

you had one task which failed after quite some time and subsequently all other WUs failed immediately with BOINC reporting "device emulation" again. Could be your computer needed a restart after the crash. In such a state I suppose all / most other 3D software will fail as well.

182.65 is an unusual driver. If you experience more problems try to remove it and try 182.50 (WHQL).

uBronan,

I wouldn't say "is not a BOINc project any more". Actually they never were. They are among the DC pioneers and have been running great long before BOINC. Apparently they evaluated it at some point and were not pleased.

Actually I've been running folding@home GPU1 on my old ATI 1950Pro. That was cool, the first GPU crunching, long before anyone said "CUDA" officially :D
Oh, why I mention it: I thought it was pretty easy and straight forward to set up. It only got messy with new and / or multiple cards.

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10210 - Posted: 26 May 2009 | 21:23:57 UTC

anthonmg,

I've just seen you already asked the same question (well, asked for help by stating the problem) in the other thread and got plenty of help over there. Please try to keep the discussion focused, so peoples time is not wasted.

MrS
____________
Scanning for our furry friends since Jan 2002

anthonmg
Send message
Joined: 11 Apr 09
Posts: 17
Credit: 11,086,149
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10217 - Posted: 26 May 2009 | 23:50:35 UTC - in response to Message 10210.

Sorry about that. This was originally a thread about a related problem to the main one, but when the second problem croped up, seemingly unrelated to the first, I started a new thread on it, and got different help in each one. It's finally working.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10233 - Posted: 27 May 2009 | 21:41:47 UTC - in response to Message 10217.

Glad to hear the problem is solved!

MrS
____________
Scanning for our furry friends since Jan 2002

outlnder
Send message
Joined: 9 Apr 09
Posts: 7
Credit: 25,115,977
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 10237 - Posted: 28 May 2009 | 0:57:22 UTC

I have one GPU that has a runtime error every week or so. It kills one running WU then kills the next 5 or 6. After a reboot, it runs fine for the next week.

Any ideas why this one machine has this problem and how I can fix it??

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10239 - Posted: 28 May 2009 | 2:52:57 UTC - in response to Message 10237.

I have one GPU that has a runtime error every week or so. It kills one running WU then kills the next 5 or 6. After a reboot, it runs fine for the next week.

Any ideas why this one machine has this problem and how I can fix it??

If you raised the OC, lower it ... check the fans for dirt, check running temps, make sure you are running one of the "approved" versions of the drivers for your OS ... check for viruses and malware ...

and not to be too flip, reboot the machine every three days ... :)

outlnder
Send message
Joined: 9 Apr 09
Posts: 7
Credit: 25,115,977
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 10242 - Posted: 28 May 2009 | 4:28:48 UTC

This one GPU does run hotter than my other 3, but I have increased the fan to 90%.

Is runtime an OS program or Nvidia or Cuda??

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10272 - Posted: 28 May 2009 | 21:23:00 UTC

I checked one of your hosts and it gets quite some errors. It's running at 1.40 GHz, whereas standard is 1.25 GHz. Could be clock speed.. keep in mind that individual chips and their frequency / temperature headroom are different and they degrade over time.

Runtime just means the error happened while you actually run the app, not during compilation or whatever.

MrS
____________
Scanning for our furry friends since Jan 2002

outlnder
Send message
Joined: 9 Apr 09
Posts: 7
Credit: 25,115,977
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 10283 - Posted: 29 May 2009 | 6:19:05 UTC

This MSI card is factory OC'ed. I have not done any overclocking myself. In fact, EVGA Precision shows it at less than MSI claims, 648 to 655 respectively.

I did OC the entire system and have cut that back a bit. Maybe that might help.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10304 - Posted: 29 May 2009 | 15:43:34 UTC - in response to Message 10242.
Last modified: 29 May 2009 | 15:51:26 UTC

This one GPU does run hotter than my other 3, but I have increased the fan to 90%.

outlnder, since that card is running hotter than your other 3 MSI factory OCed GTX 260s it most likely has a problem, maybe a heatsink that isn't getting good contact. Since it's a very new card you may want to consider an RMA. Have you tried it on a different machine (not that you have many to spare :-)

Edit: It looks like you might have already swapped it with a different card since the failing one is listed as your oldest client. Personally I'd RMA it.

outlnder
Send message
Joined: 9 Apr 09
Posts: 7
Credit: 25,115,977
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 10322 - Posted: 30 May 2009 | 9:39:59 UTC

The last errored out WU also errored the Docking WU's being done by the CPU. This tends to tell me that it isn't the GPU that is causing this problem. I will continue to watch it, but I think it may be the OS causing the errors.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10502 - Posted: 12 Jun 2009 | 20:24:48 UTC - in response to Message 10322.

The last errored out WU also errored the Docking WU's being done by the CPU. This tends to tell me that it isn't the GPU that is causing this problem. I will continue to watch it, but I think it may be the OS causing the errors.


That may be a very useful observation. Swap GPUs with a system that works. If the errors travel with the GPU it looks like hardware (clockspeed / temperature), but if the same machine erros out than it's not the GPU and an RMA won't help.

People frequently blame the OS if anything goes wrong.. but mostly that's not the reason. There could be file corruption or some dodgy driver installation, but there's also CPUs overclocked too much and defect memory or, much more common: RAM set to wrong timings, either by the user during OC or the bios in automatic mode.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : Eight compute errors in a row

//