Advanced search

Message boards : Number crunching : Errors resuming after power outage

Author Message
Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,125,751,189
RAC: 1,298,853
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40492 - Posted: 17 Mar 2015 | 16:54:40 UTC
Last modified: 17 Mar 2015 | 17:01:00 UTC

My computer recently restarted, unexpectedly. It may have been a brief power outage, though I am not 100% sure.

When it restarted, and BOINC tried to load up tasks, problems occurred with the GPUGrid tasks. When each task was loaded, it resulted in a TDR, and then a task failure ... for all 6 of my in-progress tasks.

They all resulted in:

Server state Over Outcome Computation error Client state Compute error Exit status -52 (0xffffffffffffffcc) Unknown error number Validate state Invalid


And they all had the following at the bottom of their stderr.txt:
SWAN : FATAL Unable to load module .mshake_kernel.cu. (702)


Can anything be done to make this scenario, able to be restarted and resumed, for GPUGrid GPU tasks?

e13s16_e1s33f90-NOELIA_1mgx1-2-4-RND0021_0
http://www.gpugrid.net/result.php?resultid=13982550
e26s10_e20s4f232-SDOERR_villinpub2-0-1-RND0381_3
http://www.gpugrid.net/result.php?resultid=13983070
e15s46_e1s400f24-NOELIA_1mgx2-1-4-RND5323_0
http://www.gpugrid.net/result.php?resultid=13983199
2Mgx471-NOELIA_INSP-11-12-RND1315_0
http://www.gpugrid.net/result.php?resultid=13983283
e12s13_e4s36f65-NOELIA_1mgx1-3-4-RND7924_0
http://www.gpugrid.net/result.php?resultid=13983393
e15s46_e1s386f84-NOELIA_1mgx1-1-4-RND3709_0
http://www.gpugrid.net/result.php?resultid=13983801

Note: On this computer, I load my 3 GPUs with 2-tasks-per-GPU.

Dayle Diamond
Send message
Joined: 5 Dec 12
Posts: 68
Credit: 1,046,760,203
RAC: 770,289
Level
Met
Scientific publications
watwatwatwatwatwatwatwat
Message 40507 - Posted: 18 Mar 2015 | 15:22:50 UTC - in response to Message 40492.

Seconded.

My neighborhood is a little, uh, neglected. During warm weather, the AC drain becomes too much on the system and the whole block shuts off. I just lost about six hours of crunching yesterday due to periodic power outages.

Would hate for this to be a regular issue all summer.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1060
Credit: 1,125,751,189
RAC: 1,298,853
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40536 - Posted: 20 Mar 2015 | 2:05:49 UTC

MJH: Any chance you might look at this problem?

Trotador
Send message
Joined: 25 Mar 12
Posts: 83
Credit: 1,067,720,092
RAC: 124,859
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 40729 - Posted: 31 Mar 2015 | 18:15:04 UTC

I have two validation errors after an abrupt power-off of a host. The wus resumed from some check points and completed but ended in validation errors (two GPUs host).

It was my fault, just unplugged it unintetionally while tinkering around, what a dumb!. Two of those 255 Kpoints ones!, it hurts!

Just reporting.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 335
Credit: 3,802,267,109
RAC: 868,855
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40730 - Posted: 31 Mar 2015 | 22:17:37 UTC - in response to Message 40729.

I had the same problem 2 days ago. The WU's either crash immediately or they continue normally and than you get the validation error, when they upload. The crashing immediately is not a new problem, but the validation error is.


Duane Bong
Send message
Joined: 21 Feb 10
Posts: 11
Credit: 191,437,409
RAC: 455,074
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 47003 - Posted: 18 Apr 2017 | 7:39:20 UTC
Last modified: 18 Apr 2017 | 7:40:48 UTC

I just had a WU that gave a Computation error after 23 hours of crunching because of a power failure. It happens to me every now and then, especially during rainy seasons when thunder causes the power in my house to trip. Over the years, I've probably lost 30-40 half completed WUs this way.

It is unfortunate is that GPUGrid doesn't resume from the last check point and instead errors and everything is lost. All the other projects I do like P95 or WGC simply resume after power failures from the last saved checkpoint.

Is this something that the developers can improve on?

Erich56
Send message
Joined: 1 Jan 15
Posts: 369
Credit: 1,614,528,652
RAC: 2,747,341
Level
His
Scientific publications
watwatwat
Message 47005 - Posted: 18 Apr 2017 | 9:01:03 UTC

Duane,

somewhere in the PC settings you see "disc caching" - this should be unchecked.

Duane Bong
Send message
Joined: 21 Feb 10
Posts: 11
Credit: 191,437,409
RAC: 455,074
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 47006 - Posted: 18 Apr 2017 | 12:17:35 UTC - in response to Message 47005.
Last modified: 18 Apr 2017 | 12:19:38 UTC

Duane,
somewhere in the PC settings you see "disc caching" - this should be unchecked.


Thanks for the suggestion. I checked in my Device Manager under the drive > policies and find that the Write Caching box is already unchecked.... yet I still lost the WU after the power outage. But I don't think data corruption of the checkpoint is the issue.

This is the report I see for the WU:
SWAN : FATAL Unable to load module .mshake_kernel.cu. (719)

Seems after the power failure and reboot it has some kernel error?
It is the exact same problem that the starter of this thread reported.
But yet the next WU in the queue starts crunching fine after that.

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -52 (0xffffffcc)
</message>
<stderr_txt>
# GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 960
# ECC : Disabled
# Global mem : 2048MB
# Capability : 5.2
# PCI ID : 0000:28:00.0
# Device clock : 1291MHz
# Memory clock : 3600MHz
# Memory width : 128bit
# Driver version : r381_64 : 38165
# GPU 0 : 59C
# GPU 0 : 60C
# GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 960
# ECC : Disabled
# Global mem : 2048MB
# Capability : 5.2
# PCI ID : 0000:28:00.0
# Device clock : 1291MHz
# Memory clock : 3600MHz
# Memory width : 128bit
# Driver version : r381_64 : 38165
# GPU 0 : 57C
Can't acquire lockfile - exiting
No heartbeat from core client for 30 sec - exiting
# GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 960
# ECC : Disabled
# Global mem : 2048MB
# Capability : 5.2
# PCI ID : 0000:28:00.0
# Device clock : 1291MHz
# Memory clock : 3600MHz
# Memory width : 128bit
# Driver version : r381_64 : 38165
# GPU 0 : 58C
# GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 960
# ECC : Disabled
# Global mem : 2048MB
# Capability : 5.2
# PCI ID : 0000:28:00.0
# Device clock : 1291MHz
# Memory clock : 3600MHz
# Memory width : 128bit
# Driver version : r381_64 : 38165
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# GPU 0 : 61C
# GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80]
# SWAN Device 0 :
# Name : GeForce GTX 960
# ECC : Disabled
# Global mem : 2048MB
# Capability : 5.2
# PCI ID : 0000:28:00.0
# Device clock : 1291MHz
# Memory clock : 3600MHz
# Memory width : 128bit
# Driver version : r381_64 : 38165
SWAN : FATAL Unable to load module .mshake_kernel.cu. (719)

</stderr_txt>
]]>

Post to thread

Message boards : Number crunching : Errors resuming after power outage