Advanced search

Message boards : Number crunching : Energies have become nan

Author Message
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20538 - Posted: 28 Feb 2011 | 8:50:11 UTC
Last modified: 28 Feb 2011 | 8:52:46 UTC

Skip Da Shu is getting "Energies have become nan" task errors on this system.

All tasks error out with,
ERROR: file deven.cpp line 879: # Energies have become nan

System is Linux, Boinc Version 6.12.15.

Failed tasks are TONI_KKAL2 and GIANNI_DHFR1000 tasks. No other tasks ran.

The tasks have mostly failed on the GTS250, but also on the GT340.
Some tasks have started running on one card and later ran on the other (after a Boinc or system restart).

I would suggest you remove the GTS250 and reinstall the drivers if need be.

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20841 - Posted: 2 Apr 2011 | 4:55:51 UTC
Last modified: 2 Apr 2011 | 4:58:21 UTC

I got one of these just now. Machine has a GTX570 in it. Running 266.58 drivers under Win 7 x64. It was a KASHIF_HIVPR work unit this time. It ran for 3 hours 46 mins before it died.

Link to wu here
____________
BOINC blog

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 190
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20842 - Posted: 2 Apr 2011 | 5:57:24 UTC - in response to Message 20841.

Sorted my old nan problem by under clocking the card memory by 10% :-)

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20861 - Posted: 6 Apr 2011 | 14:32:52 UTC
Last modified: 6 Apr 2011 | 14:34:44 UTC

And another one tonight here

I don't have the memory overclocked, but I do have the processor clock cranked up a little (1675Mhz for this run). The previous wu I had it at 1700Mhz. I might try dropping it some more.
____________
BOINC blog

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20867 - Posted: 6 Apr 2011 | 21:28:04 UTC - in response to Message 20861.

And another one tonight here

I don't have the memory overclocked, but I do have the processor clock cranked up a little (1675Mhz for this run). The previous wu I had it at 1700Mhz. I might try dropping it some more.

I observed that high (above 93%) GPU utilizing tasks (typically GIANNI_DHFRs, KASHIF_HIVPRs and TONI_KKALs) are more prone to error out this way than others. The solution is either you raise the GPU voltage by 0.025V (and the fan speed too), or lower the shader clock (or the memory clock of the GPU) until these errors cease popping up.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20875 - Posted: 7 Apr 2011 | 13:43:01 UTC - in response to Message 20867.

The GIANNI tasks sometimes failed when I was overclocking my GTX470's. Just reducing the clocks back to normal was enough, but some people had to up the voltage and some had to reduce the memory freq. Every GPU is different.
The temperatures rose a bit with these tasks as well, but I'm now using MSI Afterburner, which is configured to adjust fan speed automatically in response to temperature changes.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20904 - Posted: 12 Apr 2011 | 15:55:47 UTC - in response to Message 20875.

Ton (ftpd) is seeing "Energies have become nan" failure messages and also "SWAN : FATAL : Cuda driver error 2bc in file 'swanlib_nv.c' in line 244" failures on his GTX295. Mostly on Toni's tasks, but seems to impact on long and short tasks and similar errors for Ignasi's work.
(XP x86, driver: 27051, Boinc 6.10.60)

Is this a known issue that is being looked at?

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20905 - Posted: 12 Apr 2011 | 16:57:58 UTC - in response to Message 20904.

I've got one of the SWAN Cuda errors as well.

http://www.gpugrid.net/result.php?resultid=3882394
____________
Radio Caroline, the world's most famous offshore pirate radio station.
Great music since April 1964. Support Radio Caroline Team -
Radio Caroline

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20915 - Posted: 13 Apr 2011 | 6:42:07 UTC - in response to Message 20904.
Last modified: 13 Apr 2011 | 6:42:21 UTC

And again after more then 7 hours processing 2 (two) wu's cancelled with gtx295.

And again i have to wait a long time for new download.

A waste of valuable time/money/power etc!

Please look at it!
____________
Ton (ftpd) Netherlands

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 20917 - Posted: 13 Apr 2011 | 9:32:04 UTC - in response to Message 20915.

I had 4 "ERROR: file tclutil.cpp line 31: get_Dvec() element 0 (b)
called boinc_finish
" errors, but they all errored out inside 10sec. Using driver: 26724 on both systems, one with a GTX260 and the other with GTX470's.

The one task I recently had fail after some time (5h) was triggered by the system, again.

Ton, your problems might be being exasperated by the more recent Beta driver, but others have not reported problems with it, and the problems are there with the earlier driver. So I think the problem is more likely to do with the tasks themselves - basically down to Toni, Ignasi and the rest of the team to sort out.

Good luck guys,

Ross*
Send message
Joined: 6 May 09
Posts: 34
Credit: 443,507,669
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 21036 - Posted: 22 Apr 2011 | 4:22:38 UTC
Last modified: 22 Apr 2011 | 4:29:08 UTC

The following happened to several Wus with this driver on 2 boxes
A198-TONI_AGG1-8-100-RND4916_0
Workunit 2448115
Created 21 Apr 2011 19:47:36 UTC
Sent 21 Apr 2011 19:52:50 UTC
Received 21 Apr 2011 23:20:56 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 98 (0x62)
Computer ID 95964
Report deadline 26 Apr 2011 19:52:50 UTC
Run time 8796.666317
CPU time 1981.4
stderr out <core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 570"
# Clock rate: 1.56 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 15
# Number of cores: 120
# Device 1: "GeForce GTX 570"
# Clock rate: 1.56 GHz
# Total amount of global memory: 1275789312 bytes
# Number of multiprocessors: 15
# Number of cores: 120
MDIO ERROR: cannot open file "restart.coor"
ERROR: file deven.cpp line 879: # Energies have become nan

called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 35140.0810185185
Granted credit 0
application version Long runs (8-12 hours on fastest card) v6.13 (cuda31)

Also these


A560r5-TONI_AB1-21-100-RND5976_1
Workunit 2447770
Created 21 Apr 2011 15:12:33 UTC
Sent 21 Apr 2011 22:50:47 UTC
Received 22 Apr 2011 0:06:48 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 98 (0x62)
Computer ID 96625
Report deadline 26 Apr 2011 22:50:47 UTC
Run time 1046.854
CPU time 272.0657
stderr out <core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543045120 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Device 1: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543176192 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: warning: redefined atom parameters for ht
MDIO ERROR: warning: redefined atom parameters for ot
MDIO ERROR: warning: redefined atom parameters for cph1
MDIO ERROR: warning: redefined atom parameters for cph2
MDIO ERROR: warning: redefined atom parameters for nr1
MDIO ERROR: warning: redefined atom parameters for nr2
MDIO ERROR: warning: redefined atom parameters for hr3
MDIO ERROR: warning: redefined atom parameters for hr1
MDIO ERROR: warning: redefined bond parameters for ht ht
MDIO ERROR: warning: redefined bond parameters for ht ot
MDIO ERROR: warning: redefined bond parameters for nr1 cph1
MDIO ERROR: warning: redefined bond parameters for nr1 cph2
MDIO ERROR: warning: redefined bond parameters for nr2 cph1
MDIO ERROR: warning: redefined bond parameters for nr2 cph2
MDIO ERROR: warning: redefined bond parameters for cph1 cph1
MDIO ERROR: warning: redefined bond parameters for hr1 cph2
MDIO ERROR: warning: redefined bond parameters for hr3 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr1 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr2 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 nr2
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for nr2 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for hr3 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for ht ot ht
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr1 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr2 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 nr1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr1 cph2
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr2 cph2
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 hr3
MDIO ERROR: warning: redefined improper parameters for i hr1 nr1 nr2 cph2
MDIO ERROR: warning: redefined improper parameters for i hr1 nr2 nr1 cph2
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr2 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr1 cph1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr2 cph1 cph1
MDIO ERROR: cannot open file "restart.coor"
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543045120 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Device 1: "GeForce GTX 580"
# Clock rate: 1.59 GHz
# Total amount of global memory: 1543176192 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: warning: redefined atom parameters for ht
MDIO ERROR: warning: redefined atom parameters for ot
MDIO ERROR: warning: redefined atom parameters for cph1
MDIO ERROR: warning: redefined atom parameters for cph2
MDIO ERROR: warning: redefined atom parameters for nr1
MDIO ERROR: warning: redefined atom parameters for nr2
MDIO ERROR: warning: redefined atom parameters for hr3
MDIO ERROR: warning: redefined atom parameters for hr1
MDIO ERROR: warning: redefined bond parameters for ht ht
MDIO ERROR: warning: redefined bond parameters for ht ot
MDIO ERROR: warning: redefined bond parameters for nr1 cph1
MDIO ERROR: warning: redefined bond parameters for nr1 cph2
MDIO ERROR: warning: redefined bond parameters for nr2 cph1
MDIO ERROR: warning: redefined bond parameters for nr2 cph2
MDIO ERROR: warning: redefined bond parameters for cph1 cph1
MDIO ERROR: warning: redefined bond parameters for hr1 cph2
MDIO ERROR: warning: redefined bond parameters for hr3 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr1 cph1
MDIO ERROR: warning: redefined angle parameters for cph2 nr2 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 nr2
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for nr2 cph2 hr1
MDIO ERROR: warning: redefined angle parameters for hr3 cph1 cph1
MDIO ERROR: warning: redefined angle parameters for nr1 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for nr2 cph1 hr3
MDIO ERROR: warning: redefined angle parameters for ht ot ht
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr1 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d cph2 nr2 cph1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d nr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 nr1
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr1 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr1 cph2 nr2 cph1
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 cph1 hr3
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr1 cph2
MDIO ERROR: warning: redefined dihedral parameters for d hr3 cph1 nr2 cph2
MDIO ERROR: warning: redefined dihedral parameters for d nr2 cph1 cph1 hr3
MDIO ERROR: warning: redefined improper parameters for i hr1 nr1 nr2 cph2
MDIO ERROR: warning: redefined improper parameters for i hr1 nr2 nr1 cph2
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 cph1 nr2 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr1 cph1 cph1
MDIO ERROR: warning: redefined improper parameters for i hr3 nr2 cph1 cph1
MDIO ERROR: cannot open file "restart.coor"
ERROR: file deven.cpp line 879: # Energies have become nan

called boinc_finish

</stderr_txt>
]]>


Validate state Invalid
Claimed credit 38584.7222222222
Granted credit 0
application version Long runs (8-12 hours on fastest card) v6.13 (cuda31)

I have since gone back the 266.58 and had no issues.
Cheers
Ross*
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 21037 - Posted: 22 Apr 2011 | 8:03:49 UTC - in response to Message 21036.

The error is "Energies have become nan".
Not sure I would attribute this to the driver though; this error has been seen many times before under many drivers.

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 21055 - Posted: 25 Apr 2011 | 14:49:10 UTC
Last modified: 25 Apr 2011 | 14:51:49 UTC

I would add its typically caused by over-clocking. I had similar issues on my reference design GTX570 which seemed to go away after dropping back to stock.

See the Energies have become nan thread
____________
BOINC blog

Ken Florian
Send message
Joined: 4 May 12
Posts: 56
Credit: 1,832,989,878
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25835 - Posted: 22 Jun 2012 | 22:26:07 UTC - in response to Message 21055.

I have work units failing, sporadically, with "energies have become nan". The programmer in me loves the mysterious nature of the message. The cruncher in me wonders "what I am doing wrong"?

EVGA GTX-690 Signature, not OC'ed and no manual adjustments made to any settings.
1 GPU at about 82C, utilization about 87%
1 GPU at about 60C, utilization about 60%

Intel E8200, not OC'ed
8G ram
Antec Earth Watts 650
Single 7200RPM sata drive
Nothing else in the machine
Boinc 7.0.25
Win7, X64

Since I've never made manual adjustments to a gpu, if changes are necessary, please give me specific recommendations. I will be using EVGA's PrecisionX tool.

Thanks,

Ken

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25837 - Posted: 23 Jun 2012 | 7:53:57 UTC - in response to Message 25835.
Last modified: 23 Jun 2012 | 21:14:28 UTC

nan means "not a number".
I think this error occurs when the GPU has an unrecoverable failure, and as described in this thread is often a result of overclocking, temps being too high, or voltage being insufficient. Not all tasks are the same here; some utilize the GPU to a greater extent, demanding more from it. These tasks are more prone to failures. The lack of thread-safe code may also be an issue, but to some extent that would just hide a bad setup.

82°C is pushing it, slightly.

Try the generic recommendations for this situation:
Increase the GPU Fan speed so that it keeps the GPU below 70°C,
Reduce the GDDR5 frequency by 10% or 20% should that fail,
Increase the Voltage (if you can on that card), but only by the least amount (typically ~0.025V)
Also consider better cooling - adding a case fan or two, or just leaving the door off, slightly. Removing back plates might also be useful, as might a better CPU fan; one that radiates or blows less heat onto the GPU. If you don't get anywhere with the above, try downclocking the CPU in W7 Power management (not ideal but substantially reduces heat from the CPU).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25841 - Posted: 23 Jun 2012 | 18:35:35 UTC

The definition of NaN from Wikipedia explains a lot.

Profile Raptures Riot
Send message
Joined: 30 Apr 11
Posts: 6
Credit: 220,588,795
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26639 - Posted: 17 Aug 2012 | 21:45:36 UTC - in response to Message 20538.

My candid feeling is that the 6.16 (Cuda 42) is an agressive routine. My 470's are running at stock voltages and frequencies. Some potential adjustments (especially for the inexperienced of us) could shorten the life of these cards and reduce our contributions to this project. The previous coding ran nearly flawlessly for me. I take these suggestions to heart, but also with a grain of hesitation. I, for one, am not so experienced to acheive high success on this Grid while compromising future potential. It is, as it always is, to each their own. Caution is always best. I may be wrong, but I am always willing to learn.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,201,255,749
RAC: 6,169
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26641 - Posted: 17 Aug 2012 | 23:23:13 UTC - in response to Message 26639.
Last modified: 17 Aug 2012 | 23:25:21 UTC

My candid feeling is that the 6.16 (Cuda 42) is an agressive routine.

I came to the same conclusion when I had to set my GTX 590 to 625MHz for the CUDA 4.2 client. It was running fine at 725MHz with the CUDA 3.1 client. However, the CUDA 4.2 client is 40% faster than the CUDA 3.1 client, so it can do 20% more work at the lower frequency.

My 470's are running at stock voltages and frequencies.

These cards were made for gaming, not for crunching. When a card fails once in a 4 hour gaming session, the player hardly notices the glitch caused by that failure at all. So these cards maybe under-voltaged by factory setting, especially the GTX 470 and the GTX 480. When a 4 hour workunit experiences the same faliure, it will run into an error message. It is debatable, that the client should try to go on from the last checkpoint in this case, instead of aborting the task immediately. But if a card is unreliable, it's safer to discard the entire workunit.

Some potential adjustments (especially for the inexperienced of us) could shorten the life of these cards and reduce our contributions to this project.

Errors reduce the contributions to this project. Errors cost a lot of electricity for the cruncher in vain. If you lower your GPU frequency, it won't shorten your card's lifespan, but it will make it more reliable at the same voltage.

The previous coding ran nearly flawlessly for me.

I have a Gigabyte GTX 480 with 1000mV GPU voltage by default. It was nearly flawless, while my ASUS GTX 480 at 1025mV was really flawless. I raised the Gigabyte's voltage to 1025mV, and it became also really flawless. It was more than 2 years ago, and these cards are still crunching 24/7 at an even higher voltage and frequency (equipped with a better than factory cooling).

Profile Raptures Riot
Send message
Joined: 30 Apr 11
Posts: 6
Credit: 220,588,795
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 26643 - Posted: 18 Aug 2012 | 3:20:55 UTC - in response to Message 26641.

Thank you Retvari for your reply. I'm making small changes to try and improve my reliability. The card's <less than ideal> gaming configuration is a valuable point and I have learned something.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 28318 - Posted: 31 Jan 2013 | 18:32:30 UTC - in response to Message 26643.
Last modified: 31 Jan 2013 | 18:39:43 UTC

We could do with some feedback regarding these recent "Energies have become nan" errors:

Name 48_14-NOELIA_hfXA_long_30-0-2-RND7978_2
Workunit 4080964
Created 30 Jan 2013 | 10:26:34 UTC
Sent 30 Jan 2013 | 14:18:32 UTC
Received 30 Jan 2013 | 22:24:51 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 98 (0x62)
Computer ID 139265
Report deadline 4 Feb 2013 | 14:18:32 UTC
Run time 28,221.08
CPU time 28,208.29
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v6.17 (cuda42)

    Stderr output

    <core_client_version>7.0.44</core_client_version>
    <![CDATA[
    <message>
    - exit code 98 (0x62)
    </message>
    <stderr_txt>
    MDIO: cannot open file "restart.coor"
    SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
    Assertion failed: a, file swanlibnv2.cpp, line 59

    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.
    MDIO: cannot open file "restart.coor"
    SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574.
    Assertion failed: a, file swanlibnv2.cpp, line 59

    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.
    MDIO: cannot open file "restart.coor"
    MDIO: cannot open file "restart.coor"
    ERROR: file deven.cpp line 1106: # Energies have become nan

    called boinc_finish

    </stderr_txt>
    ]]>



Thanks,
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Number crunching : Energies have become nan

//