Advanced search

Message boards : Graphics cards (GPUs) : Advice for eliminating errors

Author Message
Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 28672 - Posted: 22 Feb 2013 | 15:07:45 UTC

I have been on GPUGRID for only a week now, running only long work units on a GTX 560 and a GTX 650 Ti, but a pattern has become clear. All my work units (16 Noelia and 1 Toni) have completed successfully. But I see that many of them (5 Noelia and 1 Toni) have errored-out on other PCs, a couple of them four times. Clearly, there is a problem.

To cut to the chase, my many years on Folding@home suggests the following:
Don't overclock
Keep your cards cools
Have an adequate power supply (good quality is better than big by the way).
Do a clean uninstall of old drivers before installing new ones.

For the latter, I give the experience I have learned the hard way (most people don't know it) here:
http://www.gpugrid.net/forum_thread.php?id=3296&nowrap=true#28665

Lots of luck, but I will miss the work you are sending me by your errors.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 554
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 28859 - Posted: 28 Feb 2013 | 9:33:19 UTC - in response to Message 28672.

Good summary. For more details, see the detailed FAQ in another thread.

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 177
Credit: 131,725,186
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 28908 - Posted: 2 Mar 2013 | 5:48:18 UTC - in response to Message 28859.

Hi, Jim1348:

All good advice. I started on GPUGrid also a week or so ago, running short and long tasks. I was so pleased with the results produced by my new GTX 650 Ti on one of my AMD A10 5800K machines, I went and bought another GTX 650 Ti which is being installed on my AMD Phenom 1090T. It should be in production latest Saturday night. The decision was made easier by the $80 price reduction in total on the two GPUs.

I'm committed....

John

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 29237 - Posted: 24 Mar 2013 | 21:19:34 UTC
Last modified: 24 Mar 2013 | 21:42:59 UTC

FWIW, I have seen far fewer errors recently. (These cards continue to chug along, with their only errors being the ACEMD beta 6.49 and a Nathan long.)
http://www.gpugrid.net/results.php?hostid=146004

I hope this has helped a little.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 29758 - Posted: 8 May 2013 | 19:02:03 UTC
Last modified: 8 May 2013 | 19:32:39 UTC

Here is an updated link. The results now include a few Noelias in addition to the Nathans.
http://www.gpugrid.net/results.php?hostid=150900&offset=0&show_names=1&state=0&appid=

My two GTX 660s are each running on a virtual core of an i7-3770 (nothing overclocked), with WCG/CEP2 running on the other six cores. The cards are at 66 C at the moment. As the room heats up I will move them to a cooler part of the basement before they get to 70 C or so.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 29821 - Posted: 11 May 2013 | 13:25:25 UTC
Last modified: 11 May 2013 | 13:33:49 UTC

Maybe I should mention a few of my BOINC settings, though I don't know if they make a difference as to reliability, it is just the way I normally set it up, since it is a dedicated PC and I don't try to do other work on it (Win7 64-bit and BOINC 7.0.64 x64):

Computing allowed:
While computer is in use: enabled
Use GPU while computer is in use: enabled

On multiprocessor systems, use at most 90.00% of the processors
(this is necessary to reserve a virtual core for each of my two GTX 660s, but if you have only one card, it may not be necessary depending on the card you have and what other projects you run; just try to keep one core free for each GPU.

Leave application in memory while suspended: enabled (very important)

The other settings are pretty much default, or depend on your system (disk drive size, etc.) But I always run with lots of DRAM memory (8 GB or 16 GB), though that is largely because I use a Ramdisk for the BOINC folder in order to run several CEP2 jobs at once. If you don't need that, then 4 GB should do, but if you don't have enough memory you will have to use virtual (swap to disk) memory, which might cause problems also.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 31826 - Posted: 5 Aug 2013 | 8:24:43 UTC - in response to Message 29758.
Last modified: 5 Aug 2013 | 8:26:31 UTC

I changed BOINC around, trying the WCG 7.2.7 version (32 bit) and then going to 7.2.5 64-bit. That changed the Computer ID.
http://www.gpugrid.net/results.php?hostid=156445

It seems that some work got lost in the process.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3967
Credit: 1,804,193,139
RAC: 501,342
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31827 - Posted: 5 Aug 2013 | 10:27:58 UTC - in response to Message 31826.
Last modified: 5 Aug 2013 | 10:31:11 UTC

Thanks for pointing that out - hopefully others will avoid doing the same.

Those are WCG specific Beta versions, for Boinc/WCG dev's and testers.
GPUGrid uses x64 apps for Linux and x86 apps for Windows, but Boinc x86 and x64 use different folders.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 31829 - Posted: 5 Aug 2013 | 18:33:47 UTC - in response to Message 31827.

Quite right, but the scenario was a bit more complicated than I explained, mainly because I have the BOINC data folder on a ramdisk, which loses its contents if the PC is not shut down in a controlled manner. At some point after installing WCG/BOINC 7.2.7 (32-bit), the PC froze up, or at least the GPUs did (the CEP2 work on the CPU continued normally). How that happened I don't know, but I had to shut it down and lose the work.

Now that I have reinstalled a 64-bit version of BOINC (7.2.5 x64), everything is humming along normally and I hope it stays that way.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3967
Credit: 1,804,193,139
RAC: 501,342
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31839 - Posted: 6 Aug 2013 | 9:41:45 UTC - in response to Message 31829.

If you are still using the RAM disk you might want to consider a backup routine, for the next freeze crash.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 31845 - Posted: 6 Aug 2013 | 13:49:37 UTC - in response to Message 31839.

Yes, it can be backed up periodically, and I might experiment with that. Saving it every couple of hours would help from losing an entire work unit.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 37089 - Posted: 20 Jun 2014 | 13:23:32 UTC

FWIW, I have replaced my GTX 660s with two GTX 750 Ti (Maxwell) cards, at least for the summer. They have great efficiency, and run around 64 to 67 C in a basement that is about 70 to 72 F (22 C) in the summer. No problems thus far, using WinXP for maximum output (337.88 drivers).
http://www.gpugrid.net/results.php?hostid=173205

daveandton
Send message
Joined: 30 Mar 09
Posts: 9
Credit: 28,209,509
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwat
Message 37344 - Posted: 21 Jul 2014 | 22:40:27 UTC - in response to Message 37089.

Hi guys. My problem is that every time I have to shut down Ubuntu 14.04 for any reason the WU errors out every time I restart. I have a GTX 750 Ti. Has anybody heard of this before because it is very annoying. Thanks in advance. Dave
____________

Wrend
Send message
Joined: 9 Nov 12
Posts: 50
Credit: 506,337,784
RAC: 0
Level
Lys
Scientific publications
watwatwatwat
Message 37754 - Posted: 29 Aug 2014 | 21:56:38 UTC - in response to Message 37344.
Last modified: 29 Aug 2014 | 22:05:01 UTC

Hi guys. My problem is that every time I have to shut down Ubuntu 14.04 for any reason the WU errors out every time I restart. I have a GTX 750 Ti. Has anybody heard of this before because it is very annoying. Thanks in advance. Dave

I am not familiar with the BOINC/Ubuntu interface, if it is any different than Windows. Are you suspending the projects in BOINC first before shutting down and then resuming them when you are back up and running again?

I rarely actually shut down my computer and only restart it on occasion though, usually only for certain software updates or hardware maintenance. Therefore, I haven't actually had the need to stop crunching for GPUGrid yet beyond a project reset when I was still starting crunching and configuring settings. Will likely let you know how it goes for me in about three weeks (next planned restart at this time) if it's still an issue for you at that time.
____________
My BOINC Cruncher, Minecraft Multiserver, Mobile Device Mainframe, and Home Entertainment System/Workstation: http://www.overclock.net/lists/display/view/id/4678036#

Profile Acey Pilot
Send message
Joined: 4 Jan 14
Posts: 4
Credit: 1,901,240,470
RAC: 0
Level
His
Scientific publications
watwatwatwat
Message 40575 - Posted: 22 Mar 2015 | 5:24:45 UTC - in response to Message 37344.

It happens to me in Windows if I loose power suddenly. If Boinc is not shut down gracefully, this happens almost every time. Indeed annoying, as work and energy are lost.

Bjarke
Send message
Joined: 1 Mar 09
Posts: 8
Credit: 74,871,366
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40646 - Posted: 27 Mar 2015 | 12:01:31 UTC

Memory clock speed / GDDR frequency is known to cause stability errors for some cards even with default factory settings. See message 37875.

My Quadro K4000 (GK106GL core) with the factory settings was causing unstable results "The simulation has become unstable. Terminating to avoid lock-up" seen in the Stderr output (see below). Too many of these errors and the WU will fail completely.

The solution was to downclock the memory speed using MSI Afterburner to a point were I no longer get any errors messages in the Stderr output at all. I believe these errors should be eliminated completely - and not only so the WU will just finish. Since GPUgrid does not require much memory throughput, lowering memory clock had no measurable effect on my runtime.

Stderr output
<core_client_version>7.4.36</core_client_version>
<![CDATA[
<stderr_txt>
# GPU [Quadro K4000] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : Quadro K4000
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:03:00.0
# Device clock : 810MHz
# Memory clock : 2708MHz
# Memory width : 192bit
# Driver version : r340_00 : 34105
# GPU 0 : 60C
# GPU 0 : 61C
# GPU 0 : 62C
# GPU 0 : 63C
# GPU 0 : 64C
# GPU 0 : 65C
# Simulation unstable. Flag 10 value 2285
# The simulation has become unstable. Terminating to avoid lock-up
# The simulation has become unstable. Terminating to avoid lock-up (2)
# Attempting restart (step 9739000)
# GPU [Quadro K4000] Platform [Windows] Rev [3301M] VERSION [60]
# SWAN Device 0 :
# Name : Quadro K4000
# ECC : Disabled
# Global mem : 3072MB
# Capability : 3.0
# PCI ID : 0000:03:00.0
# Device clock : 810MHz
# Memory clock : 2708MHz
# Memory width : 192bit
# Driver version : r340_00 : 34105
# GPU 0 : 64C
# Time per step (avg over 2765000 steps): 6.525 ms
# Approximate elapsed time for entire WU: 81561.572 s
05:56:11 (5560): called boinc_finish

</stderr_txt>
]]>

Walblue
Send message
Joined: 25 Feb 15
Posts: 6
Credit: 67,500
RAC: 0
Level

Scientific publications
wat
Message 40783 - Posted: 8 Apr 2015 | 1:21:09 UTC

ORRR USe ATI radeon card. :P
:(
but seems to not in here... where the tests for ati ?
why it was noting using all borad power ? like opther projects ?
:(
its very sad .. so many time and still people use the gamers board only then have problem,... and only realise after some checkings with a lot of error results... making guess to see what is correct... :/
nvidia graphic board dont check processing things... they made cuda scientific boards for it.
radeons works faster and better for this :/
why WCG and others stoped to use gpus...
and then we got in here how after few time to use gpu.. and.... only nvidia ?
if was to chose one was to be ati ones. :/
thats for game graphics that not need exact calcs.
and are made to be faster even with errors, errors ...
:/

Jim1348
Send message
Joined: 28 Jul 12
Posts: 446
Credit: 1,102,915,752
RAC: 2,217,428
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwat
Message 42652 - Posted: 17 Jan 2016 | 17:47:15 UTC
Last modified: 17 Jan 2016 | 18:04:12 UTC

I am a bit perplexed by all the reported errors on the GERARD_CXCL12 work units. I haven't seen a single error, though I have not been running GPUGrid much recently due to the present shortage, not because of errors. These machines all run Win7 64-bit:

PC1 with two GTX 960s: http://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=1&state=0&appid=
PC2 with two GTX 750 Tis: http://www.gpugrid.net/results.php?hostid=187798&offset=0&show_names=1&state=0&appid=
PC3 with two GTX 750 Tis: http://www.gpugrid.net/results.php?hostid=187323&offset=0&show_names=1&state=0&appid=

Maybe the Maxwell cards just do better, but more likely the problem is due to overclocking in some form. I have found that the GTX 960s will produce "unstable machine" notices (but no errors yet) when the P2 memory clock is boosted, for example, so I just leave it at the default setting. The GTX 750 Tis will take a moderate boost on the GPU clock; I sometimes use 1348 MHz to help ensure completion of the work units within 24 hours, but get into trouble somewhere above 1400 MHz, for example.

I think people have just gotten used to the "easy" work units, and get tripped up by these harder ones. And remember, a factory overclock is still an overclock insofar as the Nvidia chip is concerned. The above cards are very minimally factory overclocked, which is how I usually buy them. You may need to reduce your clock below the factory set value. Good luck.

Post to thread

Message boards : Graphics cards (GPUs) : Advice for eliminating errors