Too many errors (may have bug)

Message boards : Number crunching : Too many errors (may have bug)

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 18705 - Posted: 16 Sep 2010 \| 10:36:33 UTC Last modified: 16 Sep 2010 \| 11:29:07 UTC
	My task, g240-TONI_CAPBIND99SB-48-200-RND7652 and Other TONI WU’s have the same bug. If anyone gets this TONI failure, restart your computer. It looks like the error is influencing other tasks! My system, AMD 64 X2 Dual Core Processor 5200+ [Family 15 Model 67 Stepping 2] (2 processors) NVIDIA GeForce GTX 470 (1279MB) driver: 25896 Microsoft Windows XP Professional x86 Edition, Service Pack 3, (05.01.2600.00) TONI task failed after 3sec (similar failures from other crunchers) My next work unit behaved strangely; Checked in on it when it was about ¾ the way through its run. The GPU card temperature was 53deg C and the task was running at 83%. This just does not add up. On a Fermi a task running at 83% would have the GPU over 70deg C. Rebooted the system and the task is now running at 98% and the temp is just over 70deg C (with the fan turned up). Ton, thanks for the warning.
	ID: 18705 \| Rating: 0 \| rate: / Reply Quote

Bill Deilke* Send message Joined: 4 Jun 08 Posts: 4 Credit: 5,174,815 RAC: 0 Level Scientific publications	Message 18716 - Posted: 17 Sep 2010 \| 20:04:39 UTC
	I aborted 2 in the last week that seemed to run forever.
	ID: 18716 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 18728 - Posted: 21 Sep 2010 \| 1:16:42 UTC
	I have been seeing a raft of computation errors on several systems - mostly running Windows XP and 9800GT cards. This seems to be a revisit of a problem which plagued the GPUGrid applications a year ago for me. Since it has affected ALL of my systems running that combination, and I have NOT encountered similar problems with three other BOINC projects which utilize the same GPU on the same systems (SETI, Dnetc, Collatz), for now, rather than simply 'shoulder shrug' and try, try again, I am backing off of GPUGrid for now. I would hope that there would be some responsive comment on this out here, but historically, what I have seen here is either denial (it must be my hardware or software -- even though other BOINC GPU apps complete properly), or non-response. Then, perhaps in a week or two, I will try again, and some unacknowleged change will have been made and all will be well again. ____________
	ID: 18728 \| Rating: 0 \| rate: / Reply Quote

Skip Da Shu Send message Joined: 13 Jul 09 Posts: 63 Credit: 2,481,426,345 RAC: 11,302,877 Level Scientific publications	Message 18731 - Posted: 21 Sep 2010 \| 4:18:58 UTC Last modified: 21 Sep 2010 \| 4:34:14 UTC
	I'm getting over a 50% error rate across a several quads. The majority of the cards are GTS-250s, but there is a GTX-275 and a couple 8800GTs thrown in also. Same machines also do either DNETC or Collatz w/o issues. One of the worst is crunch30. crunch35 has both a GTS-250 and a GTX-275. Can anyone provide any insight to this? UPDATE: Worse than I thought... 7 pages of valid results, 18 pages of errors. ____________ - da shu @ HeliOS, "A child's exposure to technology should never be predicated on an ability to afford it."
	ID: 18731 \| Rating: 0 \| rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 920,875,294 RAC: 2 Level Scientific publications	Message 18742 - Posted: 22 Sep 2010 \| 2:22:50 UTC - in response to Message 18731.
	As I noted before -- the normal scenario here is when problems like this crop up -- there is a real limit to the amount of response we can expect... I'm getting over a 50% error rate across a several quads. The majority of the cards are GTS-250s, but there is a GTX-275 and a couple 8800GTs thrown in also. Same machines also do either DNETC or Collatz w/o issues. One of the worst is crunch30. crunch35 has both a GTS-250 and a GTX-275. Can anyone provide any insight to this? UPDATE: Worse than I thought... 7 pages of valid results, 18 pages of errors.
	ID: 18742 \| Rating: 0 \| rate: / Reply Quote

Bill Deilke* Send message Joined: 4 Jun 08 Posts: 4 Credit: 5,174,815 RAC: 0 Level Scientific publications	Message 18744 - Posted: 22 Sep 2010 \| 3:49:09 UTC
	I just aborted another one that ran on to long. Between the probs others are having and the ones I am experiencing I am pulling out until the bad WU's clear the catch.
	ID: 18744 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 18745 - Posted: 22 Sep 2010 \| 13:20:53 UTC - in response to Message 18744.
	Bill, you aborted a long task about 60% through its run. All your tasks are running well on both systems.
	ID: 18745 \| Rating: 0 \| rate: / Reply Quote

Bill Deilke* Send message Joined: 4 Jun 08 Posts: 4 Credit: 5,174,815 RAC: 0 Level Scientific publications	Message 18747 - Posted: 22 Sep 2010 \| 14:21:45 UTC - in response to Message 18745.
	Sorry, I already moved off, be back later. Trying to run SETI since 1999 has made me gun shy (crazy). So many projects so little time. Thanks for the response, I generally get criticism on other sites so I don't post.
	ID: 18747 \| Rating: 0 \| rate: / Reply Quote

Saenger Send message Joined: 20 Jul 08 Posts: 134 Credit: 23,657,183 RAC: 0 Level Scientific publications	Message 18818 - Posted: 4 Oct 2010 \| 13:50:03 UTC Last modified: 4 Oct 2010 \| 13:50:32 UTC
	I got an error on a TONI_CAPBIND as well, like many others. I can't see the original error in this thread any longer, as the WU is already purged, but here is my stderr: <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> process exited with code 98 (0x62, -158) </message> <stderr_txt> # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.34 GHz # Total amount of global memory: 536150016 bytes # Number of multiprocessors: 12 # Number of cores: 96 MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms ERROR: file mdioload.cpp line 80: Unable to read bincoordfile 14:58:03 (10049): called boinc_finish </stderr_txt> ]]> It failed after 2 seconds, so no real harm done, except that now a CASHIF_HIVPR is running, so far without problems. I hope it won't be affected, as I can't restart the puter without thrashing a checkpointless RNA-world WU after 24h, and I don't want to do that ;) If this is a problem with the work generator (some input file not generated properly) perhaps it should be looked into somehow. ____________ Gruesse vom Saenger For questions about Boinc look in the BOINC-Wiki
	ID: 18818 \| Rating: 0 \| rate: / Reply Quote

Fred J. Verster Send message Joined: 1 Apr 09 Posts: 58 Credit: 35,833,978 RAC: 0 Level Scientific publications	Message 18919 - Posted: 12 Oct 2010 \| 15:56:19 UTC - in response to Message 18818. Last modified: 12 Oct 2010 \| 16:00:19 UTC
	Found some errors <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.57 GHz # Total amount of global memory: 523829248 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [10,1,1][64,1,1] Assertion failed: 0, file swanlib_nv.cpp, line 281 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> g199r2-TONI_KKi4-2-200-RND6081_0 . x199y2-TONI_KKi4- 2-200-RND6081_0 ____________ Knight Who Says Ni N!
	ID: 18919 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Too many errors (may have bug)

	About	Science	Volunteers	Performance	Forum	Join us	Donate