Advanced search

Message boards : Number crunching : Work unit stuck?

Author Message
Profile [BAT] tutta55
Avatar
Send message
Joined: 5 Apr 07
Posts: 11
Credit: 11,175,619
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 5490 - Posted: 11 Jan 2009 | 10:10:00 UTC
Last modified: 11 Jan 2009 | 10:10:32 UTC

I have a work unit that stays at 48.352%. CPU time is already at 9h26 and counting, which is at least double as much as usual. Time to completion counts UP as well. This is the WU http://www.gpugrid.net/workunit.php?wuid=156801

What to do? Wait? Abort?
____________

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair

Profile Stefan Ledwina
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 135,911,881
RAC: 123
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 5491 - Posted: 11 Jan 2009 | 10:41:52 UTC - in response to Message 5490.

I never had a stuck WU, but have you already tried to stop and re-start BOINC? Or reboot the computer and see if it continues to crunch?
____________

pixelicious.at - my little photoblog

Profile [BAT] tutta55
Avatar
Send message
Joined: 5 Apr 07
Posts: 11
Credit: 11,175,619
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 5493 - Posted: 11 Jan 2009 | 11:22:01 UTC - in response to Message 5491.

I never had a stuck WU, but have you already tried to stop and re-start BOINC? Or reboot the computer and see if it continues to crunch?


I already tried stop/restart. Now I also did a reboot. The situation has remained the same. CPU time counting up, but % not moving.

Profile [BAT] tutta55
Avatar
Send message
Joined: 5 Apr 07
Posts: 11
Credit: 11,175,619
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 5494 - Posted: 11 Jan 2009 | 11:54:31 UTC

I finally aborted the WU. There are some strange messages in the error log. A whole series of "MDIO ERROR: illegal value: incorrect value for stepnum". Maybe that can tell the devs something?

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 5517 - Posted: 12 Jan 2009 | 9:18:22 UTC - in response to Message 5494.

I finally aborted the WU. There are some strange messages in the error log. A whole series of "MDIO ERROR: illegal value: incorrect value for stepnum". Maybe that can tell the devs something?


Seems to be quite random error as it's been already finished successfully.

Thanks though,
ignasi

Profile [BAT] tutta55
Avatar
Send message
Joined: 5 Apr 07
Posts: 11
Credit: 11,175,619
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 5560 - Posted: 13 Jan 2009 | 0:07:46 UTC - in response to Message 5517.

I finally aborted the WU. There are some strange messages in the error log. A whole series of "MDIO ERROR: illegal value: incorrect value for stepnum". Maybe that can tell the devs something?


Seems to be quite random error as it's been already finished successfully.

Thanks though,
ignasi


I can't say for sure, since I only noticed late that %done stopped moving, but I think it happened after a reboot of the system. I suspect that resuming the work unit has made it gone wrong.

Profile [BAT] tutta55
Avatar
Send message
Joined: 5 Apr 07
Posts: 11
Credit: 11,175,619
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 5665 - Posted: 16 Jan 2009 | 9:10:24 UTC

It happened again. A work unit with % done stuck: http://www.gpugrid.net/result.php?resultid=218594

This time I am sure it happened after a stop/restart. I had to reboot after a Windows update. After that the WU's % done stopped moving. The error log contains "MDIO ERROR: illegal value: incorrect value for stepnum" again. I have the impression something can go wrong when resuming a WU. Not in all cases though, most of the times WU's resume normally.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 5789 - Posted: 19 Jan 2009 | 21:06:28 UTC - in response to Message 5665.

Do you suspend the WU or you just reboot the machine while the app is running?
The state might get corrupted in the last case.

gdf

Post to thread

Message boards : Number crunching : Work unit stuck?

//