Advanced search

Message boards : Wish list : User reset for host error count

Author Message
Profile Bikermatt
Send message
Joined: 8 Apr 10
Posts: 37
Credit: 2,561,202,223
RAC: 4,498,791
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20206 - Posted: 21 Jan 2011 | 0:47:48 UTC

Right now I have a host that can not get tasks because it was having errors on a lot of tasks. I think one of the video cards may be bad or I may be having a driver issue.
Either way, I will not get tasks from GPU grid on this host for a while so it makes tracking down the problem very hard. I switched out a video card, but by the time the host starts getting tasks again I may not be around.

If I didn’t fix the issue and the errors continue it could be a really long time before I get tasks again making fixing the problem even harder. I’m glad GPU grid shuts down my hosts because it saves bandwidth and alerts me that there is a problem.

What would be nice though is if there were somewhere I could go to manually reset a host’s error count, it would allow troubleshooting and get hosts back online sooner once the user sees that there is a problem.

Is there any way this is possible?

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,925,696,560
RAC: 184,541
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20210 - Posted: 21 Jan 2011 | 8:47:12 UTC - in response to Message 20206.

For some reason (probably system related) you started to get runaway failures.

3590506 2256661 20 Jan 2011 21:39:05 UTC 20 Jan 2011 21:41:40 UTC Error while computing 3.10 1.00 7,491.18 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3590481 2256651 20 Jan 2011 21:30:06 UTC 20 Jan 2011 21:32:57 UTC Error while computing 3.09 1.09 7,491.18 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3590465 2256641 20 Jan 2011 21:32:57 UTC 20 Jan 2011 21:35:42 UTC Error while computing 2.09 0.66 0.00 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3589838 2256236 20 Jan 2011 21:58:20 UTC 20 Jan 2011 22:02:25 UTC Error while computing 3.49 1.84 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3589656 2256194 20 Jan 2011 21:44:37 UTC 20 Jan 2011 21:47:23 UTC Error while computing 4.12 1.86 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3589635 2256184 20 Jan 2011 21:41:41 UTC 20 Jan 2011 21:44:36 UTC Error while computing 2.09 1.84 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3589571 2256141 20 Jan 2011 21:35:46 UTC 20 Jan 2011 21:39:04 UTC Error while computing 2.09 1.78 7,645.29 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3588292 2255374 20 Jan 2011 10:16:36 UTC 20 Jan 2011 22:01:07 UTC Error while computing 33,279.59 3,327.99 7,903.02 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)
3587776 2254389 20 Jan 2011 6:57:29 UTC 20 Jan 2011 21:30:06 UTC Error while computing 43,945.54 4,487.77 7,903.02 --- ACEMD2: GPU molecular dynamics v6.12 (cuda)

If the system could determine that your computer was restarted, then it could start resending tasks to your system. One for the Boinc developers perhaps.

For now the alternative suggestion to allow the user the option of resetting the error count would be useful, though I would suggest this is capped (you can only pick up 1 task per GPU per hour), until a successful return of a task.

Alternatively, the creation of test work untis (say around 10min long) could be used. If the cruncher can run these then this would in itself allow new tasks to be sent under the existing system.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 211
Credit: 14,056,438,788
RAC: 2,038,348
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20792 - Posted: 27 Mar 2011 | 16:30:56 UTC - in response to Message 20210.

As there are quite a few rogue work units about now, it's resulting in the server penalizing healthy gpus. This causes them to be idle for possibly many hours if one is unlucky enough to get two of these duff tasks in row! The reset option discussed here, needs to be looked at sooner rather than later. Less patient crunchers may quit and forget to come back, lol.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,925,696,560
RAC: 184,541
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20994 - Posted: 18 Apr 2011 | 8:47:03 UTC - in response to Message 20792.

Any chance of introducing a system whereby if tasks continuously fail crunchers are sent an auto-generated email?

It could contain a suggestion to restart, a link to the recommended driver and FAQ.
If the system is reset/their driver updated they could be allowed to do an error reset, to allow say up to say 5 failures.

An email could also be sent to a CA, for guidance, if needed.

Thanks,

Post to thread

Message boards : Wish list : User reset for host error count