Advanced search

Message boards : Graphics cards (GPUs) : Working Unit Hanging...different than others reported?

Author Message
Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 9926 - Posted: 17 May 2009 | 20:29:57 UTC

Had to cancel a workunit that hung at about 85% complete (see here). Was curious if this is a different error than the others since it 1) is not one of the KASHIF_HIVPR ones--it is an IBUCH_KID, 2) I am using BOINC 6.5.0, so no 6.6.x problems, and 3) I believe that the driver is 178.24, so definitely not the 185.xx issues.

The machine in question is running an 8800GS with shaders OC'ed, but this is the first hanging unit on it so far.


Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9929 - Posted: 17 May 2009 | 21:19:19 UTC - in response to Message 9926.

Had to cancel a workunit that hung at about 85% complete (see here). Was curious if this is a different error than the others since it 1) is not one of the KASHIF_HIVPR ones--it is an IBUCH_KID, 2) I am using BOINC 6.5.0, so no 6.6.x problems, and 3) I believe that the driver is 178.24, so definitely not the 185.xx issues.

The machine in question is running an 8800GS with shaders OC'ed, but this is the first hanging unit on it so far.

Ok, we KNOW that 6.6.20 has hanging work units badly. I have seen it with other versions. THe problem is that we do NOT know what is causing this so there is no way to tell for sure what version the problem was introduced in...

Or to put it another way, you could be seeing the earliest occurence of this bug. Try a reboot and if it is the 6.6.20 problem you will likely start to see an increase in speed. USUALLY you will start to see the time to completion drop several seconds per second if this WAS the long run bug...

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9931 - Posted: 17 May 2009 | 21:34:24 UTC

I returned home 2 hours ago and see another so called big unit being stuck at 7,480 % for a long time at least the 2 hours i am at home.
Since i am now running with 6.5.0 again i can no longer see how long it is actually being stuck at that percentage.
The only thing i see is the steadily increase of the time to complete from 25 to 32 hours now, so i guess this one is also gonna error out after many hours running.
I really don't think a unit will not get some progress in more then 2 hours even if it is a long running one.

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 9934 - Posted: 18 May 2009 | 1:43:48 UTC - in response to Message 9929.


Or to put it another way, you could be seeing the earliest occurence of this bug. Try a reboot and if it is the 6.6.20 problem you will likely start to see an increase in speed. USUALLY you will start to see the time to completion drop several seconds per second if this WAS the long run bug...


Drat...I had already aborted that one before I thought about it being potentially different from the already known problem (or potentially informative as a earlier version example). That machine has already completed another unit--a RAUL unit--in typical time, without a restart.


Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9935 - Posted: 18 May 2009 | 2:02:41 UTC - in response to Message 9934.


Or to put it another way, you could be seeing the earliest occurence of this bug. Try a reboot and if it is the 6.6.20 problem you will likely start to see an increase in speed. USUALLY you will start to see the time to completion drop several seconds per second if this WAS the long run bug...


Drat...I had already aborted that one before I thought about it being potentially different from the already known problem (or potentially informative as a earlier version example). That machine has already completed another unit--a RAUL unit--in typical time, without a restart.

Scott,

that is what is making this bug so much fun. 6.6.20 was unique in that it affected nearly 50% of the tasks I ran on it. I think i have seen it on 6.6.23 or .25 ... not sure ... but, there are those odd long tasks so sometimes it is hard to know for sure until they are done. Sadly you cannot always tell by the names ... or I can't remember the key ...

At any rate, it is still on my list of things to be looked for ... I found one more pointer today, not that it will do much good ...

Profile [AF>DoJ] supersonic
Send message
Joined: 8 Nov 08
Posts: 8
Credit: 3,032,744
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9965 - Posted: 19 May 2009 | 12:21:07 UTC


Hello, just to report that after a fist IBUCH_KID that hanged last week,

I had a GIANNI_FB that hanged this weekend.

after aborting it, my machine is now stuck on a RAUL_pY ...


boinc 6.4.7, (i can't remember drivers) 8800 GTS 512

I'm surprised, because my machine has been running since december without problem.

my rac is dropping...

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9968 - Posted: 19 May 2009 | 15:07:48 UTC

Have you just tried stopping are restarting BOINC?

We KNOW that there are some issues in the Resource Scheduler with relationship to starting tasks (at least) and some other issues that MAY be related. The trouble is at the moment I am chasing rumors and have no data to work with yet (I am hoping to be collecting some as we speak)...

We KNOW 6.6.20 could cause tasks to hang, or run slow, but we do ****NOT**** know that the problem is restricted to that version. So, try just stopping and restarting BOINC first, if that does not unstick it, suspend it and let another task run, also try rebooting ...

I know it is a lot to ask ... but, if we are ever going to get our hands around this we have to figure out what the problem is and what versions it might affect.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9973 - Posted: 19 May 2009 | 19:44:17 UTC

You're right, Paul, BUT.. this is at least the second report of repeated hanging tasks and different WUs with 6.4.7. It looks like *something* is up, but it does not seem to be "the 6.6.20 problem".

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9978 - Posted: 19 May 2009 | 20:47:51 UTC - in response to Message 9973.
Last modified: 19 May 2009 | 20:55:40 UTC

You're right, Paul, BUT.. this is at least the second report of repeated hanging tasks and different WUs with 6.4.7. It looks like *something* is up, but it does not seem to be "the 6.6.20 problem".

I did not think that I said it was ... I said we don't know enough ... and that it is a possibility ... in the mean time, try these things ... :)

The more I dig into the Resource Scheduler and ponder the implications of the code buried therein the less sanguine I get about how this system works.

Richard Haselgrove has documented a problem on SaH where the CUDA tasks are started before they are initialized... and the task of course immediately crashes ... also not this problem, but it is a flaw in the way resources and tasks are managed.

I am trying to gather data for another error I am calling "silent restart" which may or may not be related to long running tasks.

The fundamental problem is that there is too much we do not know ... and too many times this last month I have dug deep into a problem and found that the error is one that has been plaguing us for years. Meaning what? Meaning that though superficial changes in some of the code may cause problems of long standing nature to slip in and out of view.

The part of code that I am worried about has not changed in a long time that I know of... which means that what was a disaster in 6.6.20 may only be a mild annoyance in other versions ... but the bug is still there...

As proof of my case, the "no heartbeat" and "too many restarts" have been longstanding problems where people would lose tasks and we had been pulling our hair out trying to figure out what was causing these errors ... well, I now know of two different potential causes. Neither are related to the tasks that were being mangled. Or to put it another way, we were looking in the wrong places...

{edit}

As an example of how this can happen, 6.4.7 (or actually any version of BOINC) and specific tasks, and specific card ... task hits particular point in the loops and takes slightly longer to get through the loop than expected. Science Application does not send heartbeat message in time, BOINC Shoots Application and relaunches at prior checkpoint which means that you could see very little advancement of the task because the task was being killed and restarted all the time.

One quick way to see if this is happening is to watch the PID of the processes running under BOINC, if the one for GPU Grid keeps increasing then ... (you have to turn on the additional column in the View menu of Task Manager for windows).

Again, we don't know why 6.6.20 caused many tasks to seemingly run forever, and though here we are concerned with the GPU, I also have experience with a system where it was happening to a CPU class task. And I am pretty sure I saw it on a task run with 6.6.23 ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10017 - Posted: 21 May 2009 | 9:44:15 UTC - in response to Message 9978.

OK, sorry, so I read too much into what you actually wrote. Apart from this.. still agreed ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10036 - Posted: 21 May 2009 | 14:41:14 UTC - in response to Message 10017.

OK, sorry, so I read too much into what you actually wrote. Apart from this.. still agreed ;)

No worries ... I did not take offense or get bugged... :)

Just trying to keep us all on the same page...

Though my input is discounted, debugging software is something I have been doing for about 34 some year. All the way from assembly language programs to ADA. Even when I play computer games I rarely play to win, I usually play to kill time and to learn how the AI cheats ...

What concerns me with this area of code is that I suspect that the same fundamental flaw is presenting itself under different guises... so we see what we think are 3-4 problems and it is really one flaw. The problem is that our diagnostic tools are very limited and hard to use.

Anyway, onward ...

Profile [AF>DoJ] supersonic
Send message
Joined: 8 Nov 08
Posts: 8
Credit: 3,032,744
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 10182 - Posted: 26 May 2009 | 9:29:12 UTC
Last modified: 26 May 2009 | 9:29:40 UTC

Hello,

I did try a stop / start of Boinc client, but no chance for me, I got another KASHIF_HIVPR unit that is hanging since 4 days now... :(

As there is no way to stop a WU other than beeing connected to the machine (BAM manages project only, no WUs), and as I have no remote connection established with this machine, here is my question :

Do a WU crunching stops when deadline is reached and crossed over ?

Or do the crunching goes on cycling and cycling until the end of times ? ;)

thank you.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10191 - Posted: 26 May 2009 | 13:50:09 UTC

All tasks have a drop dead time ... the problem is if the task or machine is hung this may or may not be detected. WIthout knowing more it is hard to say what is going on. If the task is "running" it will eventually get to the drop dead time and quit as running too long.

Profile [AF>DoJ] supersonic
Send message
Joined: 8 Nov 08
Posts: 8
Credit: 3,032,744
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 10193 - Posted: 26 May 2009 | 15:16:52 UTC

OK.

And how long is the drop dead time ?

say, 24 hours after deadline, or more than that ?

thanks.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 10197 - Posted: 26 May 2009 | 18:40:29 UTC - in response to Message 10193.

OK.

And how long is the drop dead time ?

say, 24 hours after deadline, or more than that ?

thanks.

The short answer is that I am not sure. It is not a set time per se, but the max CPU time exceeded and is a function of the speed of the system. I don't know what option the project selected here because this is about the first time we have had this issue ...

If you have BAM access you could detach and reattach through BAm and that would clear the task ... since this is a remote machine ... only thing I can think of so you can get back to being productive.

I cannot remember if you can do a project reset through BAM or not ... the other option to try if it is available.

Post to thread

Message boards : Graphics cards (GPUs) : Working Unit Hanging...different than others reported?

//