Advanced search

Message boards : Number crunching : 'Energies have become nan' error

Author Message
Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 21
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19879 - Posted: 12 Dec 2010 | 22:40:01 UTC

This error affects most of us here at least once. It happens regardless of card type, OS, driver or clocking. Certain types of wu are more prone than others.
One of my 580s fails on this error almost always however. The card works fine on other projects.
Could the code that triggers this error be tweaked to make it more card tolerant? No rush though, as on PrimeGrid I'm getting 5X the points Id get here, lol

Dave_In_Oz
Send message
Joined: 13 Jul 09
Posts: 32
Credit: 287,042,950
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19947 - Posted: 16 Dec 2010 | 10:35:11 UTC - in response to Message 19879.

I've also had this error on a GTX 295 based system. I don't do anything like over clocking, and also use TThrottle to manage the heat on my CPUs and GPUs. So I suspect it is not a hardware problem. It is also accompanied by a message "MDIO ERROR: cannot open file "restart.coor".

If this is due to a data problem, then computing a WU for many hours for no credit is pretty frustrating. Good programming tequnique should recognise the fact that people are running a project WU for many hours. I am sure the project actually does things with these failed units, either to review the failure, the problem with the particular simulation data or other bits.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19950 - Posted: 16 Dec 2010 | 11:09:36 UTC - in response to Message 19947.

Yeah, the scientists continuously review their applications and when they see errors they change things.
Don't be too concerned about one nan error, errors happen. Your GTX295 is running well at the minute.

Dave_In_Oz
Send message
Joined: 13 Jul 09
Posts: 32
Credit: 287,042,950
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19955 - Posted: 16 Dec 2010 | 13:08:31 UTC

I'm not that concerned about the odd error or single nan error. More frustrated seeing the GPUs working for hours only to fail at the end of running a WU. Some of my failures have simular times as other user runs, yet they get success. The "cannot open file "restart.coor"." error is pretty a common message when I have looked at failures.

The length of time tying up a GPU and it failing may be better handled by GPUGrid (or at least the App code). Running an 8 hr WU for nothing, when it could be running many smaller WUs from other projects may be of llower risk in terms of lost GPU hours per failure.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19957 - Posted: 16 Dec 2010 | 15:03:08 UTC - in response to Message 19955.

Restart.coor is not an error; it is also reported for successful tasks. I think this is just a file that is continuously opened during task runs, and then tries to open again after task completion, but does not need to.

I and others understand your concerns and have made many suggestions regarding such losses. I don't think there is a high suggestion implementation rate here, but I’m sure it’s for a good reason. No doubt the scientists have a very different ‘research-orientated’ perspective than us; it's their project and they have to allocate their time and resources to best suit their research. They do see the errors and more vividly than the cruncher, but I expect they are reluctant to spend time implementing difficult systemic changes that could make things go pear shaped. It’s worth noting that this is a research group; with several different projects being run under the one roof, so what might suit one project might mess with another, or a potentially beneficial change might only last for the duration of one batch of tasks (a few days).

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20020 - Posted: 24 Dec 2010 | 15:40:04 UTC - in response to Message 19957.

I don't think there is a high suggestion implementation rate here

No kidding...

Just stopped by to say MERRY CHRISTMAS to all!

Post to thread

Message boards : Number crunching : 'Energies have become nan' error

//