Advanced search

Message boards : Number crunching : Crashes running gerard wus

Author Message
wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40183 - Posted: 16 Feb 2015 | 19:07:48 UTC
Last modified: 16 Feb 2015 | 19:08:13 UTC

I tried to run several, that were not the "very long" wus over the weekend. Each of them were Gerard wus. I do not know whether this is a bug or not, but each one of the three I ran erred out on my GTX460 causing my PC to crash, and the last one somehow deleted the GPUGRID project from my list of active projects such that when I rejoined with this machine, it reset the project and "abandoned" another Gerard.

Here are the links for the wus that crashed:
http://www.gpugrid.net/result.php?resultid=13828124
http://www.gpugrid.net/result.php?resultid=13826449
http://www.gpugrid.net/result.php?resultid=13826422

I note that 13828124 also erred on a Titan Black.

Thoughts / comments?
____________

poppageek
Avatar
Send message
Joined: 4 Jul 09
Posts: 76
Credit: 114,610,402
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40188 - Posted: 16 Feb 2015 | 20:33:09 UTC

I had one that was marked Invalid for me an Error while Computing for another. Is now active on a third.

http://www.gpugrid.net/workunit.php?wuid=10656205
____________

pvh
Send message
Joined: 17 Mar 10
Posts: 23
Credit: 1,173,824,416
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40196 - Posted: 17 Feb 2015 | 16:06:41 UTC
Last modified: 17 Feb 2015 | 16:10:02 UTC

I too see a very high failure rate with GERARD_CXCL12 tasks (something like 70% fails). Other tasks run fine (well, there is the occasional failure with NOELIA tasks...). I also had one GERARD_CXCL12 task stuck for some 40 hours or so. This is clearly a very unstable batch.

Edit: to be precise: the GERARD_BENTRYP tasks appear to be OK, this only concerns GERARD_CXCL12 tasks.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40198 - Posted: 17 Feb 2015 | 18:24:49 UTC
Last modified: 17 Feb 2015 | 18:26:40 UTC

I don't think there is anything wrong with the GERARD_CXCL12 work units, they just push the card hard.

On four GTX 750 Ti's (on two machines), I have had 18 successes and 0 failures.

On one GTX 660 Ti, I have had 8 successes and 0 failures.

On one GTX 660, I have had 12 successes and 1 failure. And the one that failed completed successfully on another machine.

All of these cards run very cool, but on the GTX 660 I have not bothered to increase the power limit above 100% using Nvidia Inspector as I usually do. I see that the power routinely bumps up against this limit, which is a sure sign that the GERARD_CXCL12 work units are pushing it hard (all results under Win7 64-bit).

So my usual advice applies: don't overclock your cards, etc., etc.

pvh
Send message
Joined: 17 Mar 10
Posts: 23
Credit: 1,173,824,416
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40199 - Posted: 17 Feb 2015 | 20:02:24 UTC

My card runs at 79C, well within the tolerance limits.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40200 - Posted: 17 Feb 2015 | 20:11:56 UTC - in response to Message 40199.
Last modified: 17 Feb 2015 | 20:18:42 UTC

My card runs at 79C, well within the tolerance limits.

It is not just temperature. Overclocking (even factory overclocking) can, and frequently does cause errors.

Also, if you bump up against the power limit, the card will automatically limit the current to the GPU, in order to protect it from excessive temperature. That limit can also cause errors, since the GPU is not getting the current it needs to run at full speed. Your choices then are to increase the power limit (if the card is not running too hot), or reduce the clock. Usually reducing the GPU clock will eliminate the errors, but sometimes the memory clock needs to be reduced as well.

By the way, sometimes it is all the above. Each card is different.

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 6,772,414,375
RAC: 326,336
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40202 - Posted: 17 Feb 2015 | 21:51:05 UTC

For the GERARD_CXCL12 tasks on my machines:

32 successes on 2-GTX980's (no failures)
18 suceesses on GTX780Ti (no failures)

[CSF] Thomas H.V. DUPONT
Send message
Joined: 20 Jul 14
Posts: 732
Credit: 115,270,366
RAC: 92,842
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 40203 - Posted: 18 Feb 2015 | 7:47:41 UTC

GERARD_CXCL12_3GG_CGENFF2 4 OK (no failure)
GERARD_CXCL12_3GG_CGENFF3 4 OK (no failure)
GERARD_CXCL12_3GG_FX_LigAssay21 4 OK (no failure)
GERARD_CXCL12_LIG4_CGENFF2 4 OK (no failure)

I don't think there is anything wrong with the GERARD_CXCL12 work units, they just push the card hard.

+100

It is not just temperature. Overclocking (even factory overclocking) can, and frequently does cause errors.

+100
____________
[CSF] Thomas H.V. Dupont
Founder of the team CRUNCHERS SANS FRONTIERES 2.0
www.crunchersansfrontieres

RaymondFO*
Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40214 - Posted: 19 Feb 2015 | 18:20:10 UTC - in response to Message 40203.
Last modified: 19 Feb 2015 | 18:37:27 UTC

These all failed immediately:

http://www.gpugrid.net/result.php?resultid=13853350
http://www.gpugrid.net/result.php?resultid=13853243
http://www.gpugrid.net/result.php?resultid=13853233
http://www.gpugrid.net/result.php?resultid=13849021
http://www.gpugrid.net/result.php?resultid=13848998
http://www.gpugrid.net/result.php?resultid=13848977
http://www.gpugrid.net/result.php?resultid=13848871

Any ideas?

GoodFodder
Send message
Joined: 4 Oct 12
Posts: 53
Credit: 333,467,496
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 40278 - Posted: 27 Feb 2015 | 16:35:36 UTC

Likewise been having problems with GERARD_CXCL12 (XP x86, driver: 344.65).

Error SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1965.

Going by this thread looking like Gerard is stressing the cards hard - going to try down clocking

Post to thread

Message boards : Number crunching : Crashes running gerard wus

//