Advanced search

Message boards : Wish list : More resiliant WU processing

Author Message
Send message
Joined: 7 Apr 15
Posts: 33
Credit: 1,201,157,375
RAC: 0
Scientific publications
Message 49221 - Posted: 29 Mar 2018 | 10:56:09 UTC

Hi Everyone,

Recently, due to a faulty GPU Titan card, my system became unpredictable and started crashing regularly which resulted in calculation errors on the GPUGrid WU's.
I have multiple GPU's (GTX 1070 & 1070 Ti) in my system.

If my system crashes due to this faulty card, the WU d'office is totally lost due these calculation errors.

This in stark contrast with CPU WU's which seem to recover fully and just restart their calculations from a previously saved intermediate point, and calculate their way to successful completioin. (Rosetta, LHC, WorldCommunity Grid, ClimatePrediction) e.g. ClimatePrediction has WU's running for 36 hours on end.

Long runs take between 6 and 12 hours, if you're at 5h30 or further in a WU, that's a massive loss of time + power.

Can this be developed for GPU Grid too please ?

I would be very grateful (and I guess many crunchers with me) if you could introduce this functionality soon.

Many thanks in advance !


Send message
Joined: 30 Apr 13
Posts: 81
Credit: 1,064,621,611
RAC: 0
Scientific publications
Message 49381 - Posted: 2 May 2018 | 12:14:17 UTC - in response to Message 49221.

I second BelgianEnthousiast's suggestion. There is a periodic checkpoint of some sort, but it doesn't seem to do much good for post-crash recovery of completed work.


Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Scientific publications
Message 49385 - Posted: 2 May 2018 | 18:11:02 UTC - in response to Message 49221.

If your card was faulty it wouldn't matter how many times the WU started at last good point it would still end in failure.
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

Post to thread

Message boards : Wish list : More resiliant WU processing