Advanced search

Message boards : Multicore CPUs : "Hanging" WU?

Author Message
Profile Stefan Ledwina
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 27,175,436
RAC: 106,795
Level
Val
Scientific publications
watwatwatwatwatwat
Message 8708 - Posted: 22 Apr 2009 | 3:41:05 UTC

Well, I have one of the new CELLGA_SHORT WUs running for over one day and 5 hours now, still showing 0% done. Is it save to assume the CPU time is wasted and the WU will never complete? ;-) Should I abort it? It's this one - http://www.gpugrid.net/result.php?resultid=555615
____________

pixelicious.at - my little photoblog

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 584
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8717 - Posted: 22 Apr 2009 | 12:58:58 UTC - in response to Message 8708.
Last modified: 22 Apr 2009 | 13:46:28 UTC

Yes, please, abort it. It was expected to run for 2 hours, roughly. Thank for reporting the problem!

Profile Stefan Ledwina
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 27,175,436
RAC: 106,795
Level
Val
Scientific publications
watwatwatwatwatwat
Message 8725 - Posted: 22 Apr 2009 | 14:58:35 UTC - in response to Message 8717.

Ok, thanks! Aborted it now after 1 day and 16 hours... My nice credits... ;-)
____________

pixelicious.at - my little photoblog

Profile Stefan Ledwina
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 27,175,436
RAC: 106,795
Level
Val
Scientific publications
watwatwatwatwatwat
Message 8727 - Posted: 22 Apr 2009 | 15:49:26 UTC

Maybe it wasn't a problem with the task but with my PS3.
Now I got a TONI_CELLGA task that also seemed to hang because it showed no progress after 20 minutes, but after a reboot it is running fine now...
____________

pixelicious.at - my little photoblog

Profile Stefan Ledwina
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 27,175,436
RAC: 106,795
Level
Val
Scientific publications
watwatwatwatwatwat
Message 8732 - Posted: 22 Apr 2009 | 17:25:07 UTC

Hmmm, after 20 minutes the next task was hanging, this time a TONI_CELLGA_MED - http://www.gpugrid.net/result.php?resultid=565176. I aborted it, and will try another task...
____________

pixelicious.at - my little photoblog

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1894
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 8817 - Posted: 24 Apr 2009 | 7:14:49 UTC - in response to Message 8732.

If a workunit hangs. You can just try to restart the machine.
The problem is not usually with the wu, but with the fact that your processor is reserved by another application.

gdf

Profile Stefan Ledwina
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 27,175,436
RAC: 106,795
Level
Val
Scientific publications
watwatwatwatwatwat
Message 8819 - Posted: 24 Apr 2009 | 8:29:29 UTC

Thanks Gianni, but I think I just found the problem...

I just started my PS3 again, started BOINC and got a hanging task again. It was running for 2:37 minutes and then the CPU time was hanging with 0% progress.

When I started BOINC it was running CPU benchmarks, but after a few minutes it was running the benchmarks again. The problem is boincmgr showed "suspending computation - running CPU benchmarks", but "top" still showed 154% CPU for cellmd2_5.03_po. Another thing I noticed was the time shown in boincmgr. First it was wrong and when it started the second benchmarks it showed the right time (see screenshot). So I don't know if the task was hanging because it wasn't suspended during benchmarks or because the system time has changed. Weird thing is NTP is turned on, and the system time should always be right...

I think I'll try to delete the whole boinc dir and redownload BOINC and if that doesn't help I'll try a new YDL 6.1 install...


____________

pixelicious.at - my little photoblog

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 8822 - Posted: 24 Apr 2009 | 10:04:27 UTC

Have same problem;

http://www.gpugrid.net/workunit.php?wuid=399536
http://www.gpugrid.net/workunit.php?wuid=405292

lot's of hrs and energy wasted again.
I'm not requesting new WU for a while and only do work for yoyo.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 584
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8823 - Posted: 24 Apr 2009 | 10:46:39 UTC - in response to Message 8822.
Last modified: 24 Apr 2009 | 11:06:00 UTC

I'm trying to figure out the pattern behind those failures.

These new "short" WUs, called "SHORT" and "MED", are expected to run for <3 hours and grant approximately 300 credits. Another difference with respect to other WUs is that they generate a largish result file (*_4, approx 16M).

The WUs crunch correctly on most machines - hopefully we'll be able to reproduce the problem.

Profile Stefan Ledwina
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 27,175,436
RAC: 106,795
Level
Val
Scientific publications
watwatwatwatwatwat
Message 8824 - Posted: 24 Apr 2009 | 11:10:24 UTC - in response to Message 8823.

Thanks for looking into the problem Toni!

I previously deleted the BOINC directory, redownloaded BOINC and got a GRA_US task. Seems this one is running ok now. It`s almost at 10% after 4 hours crunching...

____________

pixelicious.at - my little photoblog

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 584
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8827 - Posted: 24 Apr 2009 | 11:29:59 UTC - in response to Message 8824.
Last modified: 24 Apr 2009 | 12:02:17 UTC

Thanks to you for reporting the symptoms: "runaway" processes may in fact be related to anomalous task suspend/resume, triggered by the CPU benchmarks.

To relieve the problem, I'm postponing some WUs and lowering the FP bound, which will hopefully halt "runaway" jobs before they hog the CPU for days (the changes will take time to propagate, though).

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 8908 - Posted: 25 Apr 2009 | 13:26:40 UTC - in response to Message 8827.
Last modified: 25 Apr 2009 | 13:28:45 UTC

Hi TG,
No benchmark was running during start and the time I cancelled the WU, below I inserted the log from the moment I started till I cancelled it.
Hope you can fix it or at least build in a max running-time for certain WU.
Will follow the forums to see if things get better and I'll be back than.


do 23 apr 2009 18:27:40 MDT|GPUGRID|Starting A8-TONI_CELLGA_MED_4-6-40-RND6539_1
do 23 apr 2009 18:27:40 MDT|GPUGRID|Starting task A8-TONI_CELLGA_MED_4-6-40-RND6539_1 using cellmd2 version 503
do 23 apr 2009 23:27:22 MDT|GPUGRID|Sending scheduler request: Requested by project
do 23 apr 2009 23:27:22 MDT|GPUGRID|(not requesting new work or reporting completed tasks)
do 23 apr 2009 23:27:27 MDT|GPUGRID|Scheduler RPC succeeded [server version 607]
do 23 apr 2009 23:27:27 MDT|GPUGRID|Deferring communication for 31 sec
do 23 apr 2009 23:27:27 MDT|GPUGRID|Reason: requested by project
vr 24 apr 2009 04:27:28 MDT|GPUGRID|Sending scheduler request: Requested by project
vr 24 apr 2009 04:27:28 MDT|GPUGRID|(not requesting new work or reporting completed tasks)
vr 24 apr 2009 04:27:34 MDT|GPUGRID|Scheduler RPC succeeded [server version 607]
vr 24 apr 2009 04:27:34 MDT|GPUGRID|Deferring communication for 31 sec
vr 24 apr 2009 04:27:34 MDT|GPUGRID|Reason: requested by project
vr 24 apr 2009 09:27:34 MDT|GPUGRID|Sending scheduler request: Requested by project
vr 24 apr 2009 09:27:34 MDT|GPUGRID|(not requesting new work or reporting completed tasks)
vr 24 apr 2009 09:27:44 MDT|GPUGRID|Scheduler RPC succeeded [server version 607]
vr 24 apr 2009 09:27:44 MDT|GPUGRID|Deferring communication for 31 sec
vr 24 apr 2009 09:27:44 MDT|GPUGRID|Reason: requested by project
vr 24 apr 2009 11:05:05 MDT|GPUGRID|Deferring communication for 1 min 0 sec
vr 24 apr 2009 11:05:05 MDT|GPUGRID|Reason: Unrecoverable error for result A8-TONI_CELLGA_MED_4-6-40-RND6539_1 (aborted by user)
vr 24 apr 2009 11:05:05 MDT|GPUGRID|Computation for task A8-TONI_CELLGA_MED_4-6-40-RND6539_1 finished
vr 24 apr 2009 11:05:05 MDT|yoyo@home|Resuming task ogr_090407202003_15_0 using crunch version 211
____________

Quasar
Send message
Joined: 17 Dec 08
Posts: 6
Credit: 1,943,937
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8938 - Posted: 25 Apr 2009 | 20:50:04 UTC

Could it be that this problem occurs when more than one project are using the PS3 (maybe it's related to task-switching)? I've been crunching on my PS3 for a couple of months without problems but had this problem yesterday once I attached BOINC to yoyo@home. I aborted the WU and got another one but this one seems to be hanging too. Here are the links to the WUs:

http://www.gpugrid.net/workunit.php?wuid=407696
http://www.gpugrid.net/workunit.php?wuid=407774

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 584
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8987 - Posted: 27 Apr 2009 | 12:15:53 UTC - in response to Message 8938.
Last modified: 27 Apr 2009 | 12:17:46 UTC

Yes, we believe that task switching is an issue: the accelerated processors are not properly "freed" somehow upon process termination, probably a shortcoming of the platform. :-( Do just PS3GRID hang, or also those from other projects?

Quasar
Send message
Joined: 17 Dec 08
Posts: 6
Credit: 1,943,937
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9037 - Posted: 27 Apr 2009 | 22:17:27 UTC

I think all the WUs, from PS3Grid and others, hang, at least that's what happened with me. I was looking around and found a newer BOINC client optimized for the PS3 (http://www.dotsch.de/boinc/boinc6219_10.linux-ps3.tar.gz), unfortunately it's a command-line version and I tried to get it to work with BOINC Manager but it didn't work. Maybe an update to the current PS3Grid BOINC client (whether using the one mentioned above or otherwise) would fix the problem?

jboese
Send message
Joined: 30 Jul 08
Posts: 21
Credit: 31,229
RAC: 0
Level

Scientific publications
wat
Message 9098 - Posted: 29 Apr 2009 | 5:32:47 UTC - in response to Message 8987.
Last modified: 29 Apr 2009 | 5:33:50 UTC

Just to be helpful to another BOINCer at least on my machine the hanging WU seem to be specific to gpugrid. If you set the resource share for yoyo to say 3E+38 (a huge number) and gpugrid to 1 then your ps3 will happily only crush yoyo WU and will not hang. The problem only seems to occur when the PS3 starts working on a gpugrid WU. I am not sure but the problem also seems more pronounced with the memory stick version (what I run) as compared to YDL but I think it occurs with both at times.

sam
Send message
Joined: 30 Apr 09
Posts: 15
Credit: 228,425
RAC: 0
Level

Scientific publications
wat
Message 9156 - Posted: 30 Apr 2009 | 20:02:01 UTC

just joined the gpugrid; first wu is hung 0.0%

Thu 30 Apr 2009 04:48:38 PM EDT|GPUGRID|Starting task 5-TONI_CELLGA_MED_9-0-2-RND0374_0 using cellmd2 version 503

posted because the admin mentioned trying to discern a pattern. I have not tried another ps3 based project. I just setup the machine with ydl6.1, basically vanilla.



sam
Send message
Joined: 30 Apr 09
Posts: 15
Credit: 228,425
RAC: 0
Level

Scientific publications
wat
Message 9160 - Posted: 30 Apr 2009 | 20:33:23 UTC

I guess I picked a bad day to join. I booted; no worky. aborted. started another, no worky. then I see a message suggesting I joined the project with the wrong link, the one from the website. so I tried again, new wu: Thu 30 Apr 2009 05:20:10 PM EDT|GPUGRID|Starting task 638000-IBUCH_GRAUS-1-100-RND2278_1 using cellmd2 version 503

no worky. 0.0% I just suspended the wu as I will my effort for today.

I am open recommendations regarding this project and/or ydl6.1; I am new to both.

thanks.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 584
Credit: 4,273,184
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9174 - Posted: 1 May 2009 | 11:37:14 UTC - in response to Message 9160.

Hi sam - the progress bar does not show the actual progression for PS3 WUs. All WUs last circa 12 hours each.

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9189 - Posted: 2 May 2009 | 8:10:59 UTC - in response to Message 9174.
Last modified: 2 May 2009 | 8:14:24 UTC

1 WU since a week i crunched for gpugrid and hung, +16hrs and 0%

http://www.gpugrid.net/result.php?resultid=598295

jboese
Send message
Joined: 30 Jul 08
Posts: 21
Credit: 31,229
RAC: 0
Level

Scientific publications
wat
Message 9234 - Posted: 3 May 2009 | 6:03:28 UTC - in response to Message 9160.
Last modified: 3 May 2009 | 6:04:43 UTC

I have promised to be political and will simply say the PS3 portion of this project is experiencing "growing" pains and gpugrid WU often hang. The problem is not with your setup (wish it was more clear so people don't waste their time debugging a universal project problem).

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9994 - Posted: 20 May 2009 | 10:55:02 UTC - in response to Message 9174.

after 7 finished WU this one went +39hrs with 0% progress so I canceled it.
http://www.gpugrid.net/result.php?resultid=683455
____________

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10111 - Posted: 23 May 2009 | 21:23:44 UTC - in response to Message 9174.
Last modified: 23 May 2009 | 21:26:09 UTC

Maybe next can help with hanging WU:

This http://www.gpugrid.net/result.php?resultid=694578 crunched 1st 24hrs with 0% progress. Then I restarted linux from the menu the way you would see the blue screen but without needing entering password and ROOT as user. After the reboot within an hour the progress was counting and it finished positive.
There was a donor system that did the normal hrs for this WU.
____________

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10691 - Posted: 19 Jun 2009 | 17:28:03 UTC - in response to Message 9174.

http://www.gpugrid.net/result.php?resultid=819492

it's getting boring that things aren't getting fixed
____________

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10904 - Posted: 28 Jun 2009 | 13:40:06 UTC - in response to Message 9174.
Last modified: 28 Jun 2009 | 13:42:07 UTC

Could it be that's it's heat related my WU often hang?
I notice that with higher room temp (above +21°C) I do have this hanging.
When it's 20 or below it could crunch for weeks.
I've NOT this problem with yoyo!!!!

BTW an other one wasted 2 full days of energy ... so I cancelled requesting work for the moment (26°C inside)
http://www.gpugrid.net/result.php?resultid=863967
____________

JaRski-S60R
Send message
Joined: 31 Dec 07
Posts: 27
Credit: 178,113,194
RAC: 104,261
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 11771 - Posted: 10 Aug 2009 | 0:42:35 UTC - in response to Message 9174.
Last modified: 10 Aug 2009 | 0:43:19 UTC

i've since i last started ps3grid, with yoyo suspended no problems.... it looks sofar good ....but if it's multi-project related, it isn't nice of this project to not fix it. I'm MORE satisfied if i could get yoyo back at 10% so the ps3 could ask work from that project if gpugrid runs out,server hangs ,....
____________

Post to thread

Message boards : Multicore CPUs : "Hanging" WU?