Advanced search

Message boards : Number crunching : WU getting stuck after short time

Author Message
chimmy
Send message
Joined: 24 Feb 09
Posts: 14
Credit: 1,261,660
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9382 - Posted: 6 May 2009 | 16:56:38 UTC
Last modified: 6 May 2009 | 17:01:15 UTC

I have a WU, 441130, that is constantly getting stuck. Meaning that it will run for some time and then just stop for no obvious reason. BOINC still shows the WU as running, I see the app in task manager. Yet no progress is ever made.

I've paused and restarted the WU several times but after 1-2 hours it gets stuck again.

No changes to my system, this just started happening with this WU. SETI WU's are running w/o issue. I have two 9600 GT's non-SLI.

The WU is stuck right now, also the temp on that GPU is normal. Not slightly higher when a WU is actually running.

Any ideas or things I can put in the cc_config.xml to help debug this?

Thanks,

Jim

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9398 - Posted: 6 May 2009 | 20:53:12 UTC - in response to Message 9382.

I have a WU, 441130, that is constantly getting stuck. Meaning that it will run for some time and then just stop for no obvious reason. BOINC still shows the WU as running, I see the app in task manager. Yet no progress is ever made.

I've paused and restarted the WU several times but after 1-2 hours it gets stuck again.

No changes to my system, this just started happening with this WU. SETI WU's are running w/o issue. I have two 9600 GT's non-SLI.

The WU is stuck right now, also the temp on that GPU is normal. Not slightly higher when a WU is actually running.

Any ideas or things I can put in the cc_config.xml to help debug this?

Thanks,

Jim

What BOINC Version?

If 6.6.20, try 6.5.0 or 6.6.23 ...

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9407 - Posted: 6 May 2009 | 23:44:14 UTC - in response to Message 9398.

I am seeing something very similar with this wu 439696.

I was running on 6.6.17. Just upgraded to 6.6.23 but it is still stopping (need to suspend/resume to kick it back into life for another few minutes)

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9410 - Posted: 7 May 2009 | 1:29:14 UTC - in response to Message 9407.

I am seeing something very similar with this wu 439696.

I was running on 6.6.17. Just upgraded to 6.6.23 but it is still stopping (need to suspend/resume to kick it back into life for another few minutes)

Check preferences for suspend when computer is in use. Goto perferences here, change something, change it back, save, on client do Update and see if that helps.

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9412 - Posted: 7 May 2009 | 1:33:58 UTC

I think there is something funny going on with these IBUCH_KID WU's. I've seen errors from various users with these. Maybe they are more "sensitive" to something?

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9413 - Posted: 7 May 2009 | 1:58:33 UTC - in response to Message 9412.
Last modified: 7 May 2009 | 2:02:25 UTC

I'm also seeing it on this wu 434995 which is a "KASHIF" wu.

I'll try Paul's suggestion and see what happens.

Add: - when I installed 6.6.23, "Use GPU while computer in use" was unchecked in local preferences, so I already had to change that to get GPUGRID to run at all.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9417 - Posted: 7 May 2009 | 4:27:32 UTC - in response to Message 9413.

Still happening. Suspend/resume and it takes off again for what seems to be somewhere around 30 mins to an hour.

I have a single 9800GT with Driver version 182.06.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9441 - Posted: 7 May 2009 | 15:53:39 UTC - in response to Message 9417.

Still happening. Suspend/resume and it takes off again for what seems to be somewhere around 30 mins to an hour.

I have a single 9800GT with Driver version 182.06.

Well, I am at a loss...

Sadly, I don't have that many problems so I have less experience ...:)

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9452 - Posted: 7 May 2009 | 22:00:35 UTC - in response to Message 9441.
Last modified: 7 May 2009 | 22:36:08 UTC

I have been running GPUGrid only for 6 weeks but this is the first processing problem I have encountered. I will try installing the new driver (185.85) to see whether that helps.

Update: First WU errored out almost straight away. Second WU now running. Fingers crossed..

chimmy
Send message
Joined: 24 Feb 09
Posts: 14
Credit: 1,261,660
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9453 - Posted: 7 May 2009 | 23:01:17 UTC

I ended up just aborting that WU. I picked up another and haven't had a single problem with the new WU. I'm on v182.50 drivers now.

Really have no clue what the issue was.

My .02 as someone in the software development field... It would be nice to know from the GPUGrid/BOINC devs if there are some debug options that can be enabled on the client to get useful information when this happens. That way we (royal we) wouldn't sit for days restarting & restarting a failing WU with out getting any type of useful info or results from the efforts.

Thanks,

Jim

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9458 - Posted: 8 May 2009 | 5:46:28 UTC - in response to Message 9453.

Good news: After installing 189.85 no more tasks have hung.

Bad news: The two tasks that suffered from the hang both errored out (one pretty quickly, the second after probably an hour or two further processing), though this may not be the new driver (see this thread.

I now have a GIANNI running. It is up to 13% with no hang so far so ...

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9489 - Posted: 8 May 2009 | 22:18:04 UTC - in response to Message 9458.

Good news: The GIANNI finished and validated. Now one third of the way through a KASHIF (with an IBUCH up next). On a sample of 1, the 185.85 driver may also be a little faster (perhaps 5%).

I also ran a SETI WU on 185.85 and it processed and validated fine.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9514 - Posted: 9 May 2009 | 10:34:28 UTC
Last modified: 9 May 2009 | 10:36:43 UTC

Euhm the Gianni never been a problem with me just the newer ones like IBUCH and KASHIFF ones.
And btw running seti and gpugrid has been for me a absolute no go.
After i run 3 seti units all my gpugrid units died so i never run them without rebooting to switch to one or the other.

dyeman
Send message
Joined: 21 Mar 09
Posts: 35
Credit: 591,434,551
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9527 - Posted: 9 May 2009 | 11:54:59 UTC - in response to Message 9514.

The SETI WU didn't seem to cause a problem (although I was prepared for that to be the case). The KASHIF WU finished fine and validated, and the IBUCH is now 36% done and seems to be running fine. Somewhere in between the KASHIF starting and the IBUCH starting the SETI WU processed and validated.

yorge
Send message
Joined: 24 Dec 08
Posts: 1
Credit: 8,653,364
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9565 - Posted: 9 May 2009 | 22:21:14 UTC

thank you all for solution!
____________

Post to thread

Message boards : Number crunching : WU getting stuck after short time

//