Advanced search

Message boards : Number crunching : NATHAN_dhfr36 problems

Author Message
Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29471 - Posted: 16 Apr 2013 | 0:34:40 UTC
Last modified: 16 Apr 2013 | 0:39:17 UTC

In every two or three days I catch on a random one of my hosts a NATHAN_dhfr36 workunit staying at 0% progress after a very long time (while the GPU usage seems to be normal). I've caught one after 86 hours. Some of them can be fixed by a system restart.
The latest one is this workunit (my host is the 4th). It was hanging for 15 hours when I spotted it, a restart didn't helped in this case, so I've aborted it. It usually happens on my multi-GPU hosts.

Profile nate
Send message
Joined: 6 Jun 11
Posts: 124
Credit: 2,928,865
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 29497 - Posted: 18 Apr 2013 | 10:12:36 UTC

Well, that's strange. Thanks for letting us know. For the most part my NATHAN workunits have been very stable and giving low errors, though I'm not sure if such issues like you have experienced would be seen in the error statistics (I'll look at that). Since crunchers have been complaining about this type of problem more often recently (meaning stuck work units), I wonder if there is some subtle issue with the app/client/driver, and not just a specific group of workunits... I'm noticing some very strange error messages in that group you linked. Hmmm...

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29545 - Posted: 23 Apr 2013 | 20:50:28 UTC

I've got another one of this. I've spotted it after 74 hours...
After a system restart it's ran into an application error.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29546 - Posted: 23 Apr 2013 | 21:57:37 UTC - in response to Message 29545.

This is your tasks Stderr output:

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
MDIO: cannot open file "output.restart.coor"
MDIO: cannot open file "output.restart.coor"
Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>

Earlier today I had a W7x64 system blue screen and on restarting the app crashed. It was a NOELIA_PEPTGPRC WU, however I also got the Kernel error message,

Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59

Might be something in it.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29726 - Posted: 7 May 2013 | 13:05:02 UTC - in response to Message 29546.

Earlier today I had a W7x64 system blue screen and on restarting the app crashed. It was a NOELIA_PEPTGPRC WU, however I also got the Kernel error message,

Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59

Might be something in it.

Any WU with the word NOELIA in it is a nightmare here and unfortunately forces me to move to a different project. Just wasted another several days of GPU time :-(

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29794 - Posted: 10 May 2013 | 13:18:03 UTC

Is it wise to abort this WU: I41R15-NATHAN_dhfr36_6-8-32-RND7935?
By five previous crunchers it errored out. On my GTX550Ti it will take another 34 hours to complete after 1 hour and 20 minutes gone (9.5%).
Thanks for the input.
____________
Greetings from TJ

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29795 - Posted: 10 May 2013 | 14:34:05 UTC

Cannot see your set up but if in doubt, abort. 5 resends does not inspire confidence!

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29797 - Posted: 10 May 2013 | 16:47:22 UTC - in response to Message 29794.

9.5% in 80min:- it would take ~14h (if it doesn't crash). Best to ignore some of Boinc's estimates.
If you post the resultid then we might be able to see the WU.

Is it wise to abort this WU: I41R15-NATHAN_dhfr36_6-8-32-RND7935?
By five previous crunchers it errored out. On my GTX550Ti it will take another 34 hours to complete after 1 hour and 20 minutes gone (9.5%).
Thanks for the input.


____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29799 - Posted: 10 May 2013 | 19:21:24 UTC
Last modified: 10 May 2013 | 19:21:44 UTC

This is the WU http://www.gpugrid.net/workunit.php?wuid=4427345.
Has 51% done in 7 hours.
____________
Greetings from TJ

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29800 - Posted: 10 May 2013 | 19:53:45 UTC - in response to Message 29799.

These are the recent results from the computers that your task has previously ran on:

http://www.gpugrid.net/results.php?hostid=151016 - GTX570, 314.22, W7
http://www.gpugrid.net/results.php?hostid=150071 - GTX 470, ? , Linux
http://www.gpugrid.net/results.php?hostid=138340 - GTX560Ti, 306.97, W7
http://www.gpugrid.net/results.php?hostid=151494 - GTX 470, 310.90, W7
http://www.gpugrid.net/results.php?hostid=151214 - GTX 560, 314.22, W7

If you click the links you will see that most of these systems have had a lot of errors recently, suggesting that their Boinc settings are bad. They might for example have configured their systems to not use the GPU when they are using the system, and are getting lots of driver restarts and app crashes as a result. The systems differ in their drivers, GPU's and there is even a Linux rig.

Their error messages were also different. Again this indicates setup problems. If the errors were all the same it would indicate a bug/WU problem.

It's a pity moderators/testers can't actually see the Boinc setups being used, as this prevents proactive advice, and limits what we have to go on to give advice to those that ask for it. Also, still no Linux driver in the list!
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29809 - Posted: 10 May 2013 | 22:02:45 UTC

Thanks for your explanation skgiven!
I will let the WU run, still 5 hours estimate, so I let it know tomorrow how it ended.
Indeed I have set to use GPU always and nod driver op app problems.

The only error I had the last weeks was when Microsoft was updating the nVidia driver, what I terminated and I did set that Microsoft only is allowed to update the Windows stuff and I decide when to install and the reboot. Mostly I set BOINC to no new work, to be sure.

____________
Greetings from TJ

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29814 - Posted: 11 May 2013 | 9:21:39 UTC - in response to Message 29809.

Thanks for your explanation skgiven!
I will let the WU run, still 5 hours estimate, so I let it know tomorrow how it ended.

Well it finished in 14.3 hours without error and got 70.800 credits for it.
So happy that I let it run.
____________
Greetings from TJ

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29820 - Posted: 11 May 2013 | 12:51:27 UTC

First NATHAN_dhfr36 task finished in 4.96 hours on 660TI running Linux Mint. No problems and 70,800 credits. Will be rebooting to XP soon to see how they run there.

nanoprobe
Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29860 - Posted: 12 May 2013 | 13:06:22 UTC

First NATHAN_dhfr36 task completed without errors on XP with 660TI. Took about 13 minutes longer than the same card on Linux Mint.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29871 - Posted: 12 May 2013 | 15:32:05 UTC

Thankfully the NATHANs are back so all 7 of my GPUs can again run GPUGrid. So far have run hundreds of the NATHAN_dhfr36 WUs without any errors that I know of (TONIs too).

Post to thread

Message boards : Number crunching : NATHAN_dhfr36 problems

//