Advanced search

Message boards : Number crunching : Too many errors (may have bug)

Author Message
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18705 - Posted: 16 Sep 2010 | 10:36:33 UTC
Last modified: 16 Sep 2010 | 11:29:07 UTC

My task, g240-TONI_CAPBIND99SB-48-200-RND7652 and Other TONI WU’s have the same bug.

If anyone gets this TONI failure, restart your computer.

It looks like the error is influencing other tasks!

My system,
AMD 64 X2 Dual Core Processor 5200+ [Family 15 Model 67 Stepping 2] (2 processors)
NVIDIA GeForce GTX 470 (1279MB) driver: 25896
Microsoft Windows XP Professional x86 Edition, Service Pack 3, (05.01.2600.00)

TONI task failed after 3sec (similar failures from other crunchers)
My next work unit behaved strangely; Checked in on it when it was about ¾ the way through its run. The GPU card temperature was 53deg C and the task was running at 83%. This just does not add up. On a Fermi a task running at 83% would have the GPU over 70deg C.
Rebooted the system and the task is now running at 98% and the temp is just over 70deg C (with the fan turned up).

Ton, thanks for the warning.

Bill Deilke*
Send message
Joined: 4 Jun 08
Posts: 4
Credit: 5,174,815
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 18716 - Posted: 17 Sep 2010 | 20:04:39 UTC

I aborted 2 in the last week that seemed to run forever.

BarryAZ
Send message
Joined: 16 Apr 09
Posts: 163
Credit: 920,275,294
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18728 - Posted: 21 Sep 2010 | 1:16:42 UTC

I have been seeing a raft of computation errors on several systems - mostly running Windows XP and 9800GT cards. This seems to be a revisit of a problem which plagued the GPUGrid applications a year ago for me. Since it has affected ALL of my systems running that combination, and I have NOT encountered similar problems with three other BOINC projects which utilize the same GPU on the same systems (SETI, Dnetc, Collatz), for now, rather than simply 'shoulder shrug' and try, try again, I am backing off of GPUGrid for now.

I would hope that there would be some responsive comment on this out here, but historically, what I have seen here is either denial (it must be my hardware or software -- even though other BOINC GPU apps complete properly), or non-response. Then, perhaps in a week or two, I will try again, and some unacknowleged change will have been made and all will be well again.


____________

Skip Da Shu
Send message
Joined: 13 Jul 09
Posts: 63
Credit: 1,338,595,165
RAC: 2,451,862
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 18731 - Posted: 21 Sep 2010 | 4:18:58 UTC
Last modified: 21 Sep 2010 | 4:34:14 UTC

I'm getting over a 50% error rate across a several quads. The majority of the cards are GTS-250s, but there is a GTX-275 and a couple 8800GTs thrown in also. Same machines also do either DNETC or Collatz w/o issues.

One of the worst is crunch30.

crunch35 has both a GTS-250 and a GTX-275.

Can anyone provide any insight to this?

UPDATE: Worse than I thought... 7 pages of valid results, 18 pages of errors.
____________
- da shu @ HeliOS,
"A child's exposure to technology should never be predicated on an ability to afford it."

BarryAZ
Send message
Joined: 16 Apr 09
Posts: 163
Credit: 920,275,294
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18742 - Posted: 22 Sep 2010 | 2:22:50 UTC - in response to Message 18731.

As I noted before -- the normal scenario here is when problems like this crop up -- there is a real limit to the amount of response we can expect...

I'm getting over a 50% error rate across a several quads. The majority of the cards are GTS-250s, but there is a GTX-275 and a couple 8800GTs thrown in also. Same machines also do either DNETC or Collatz w/o issues.

One of the worst is crunch30.

crunch35 has both a GTS-250 and a GTX-275.

Can anyone provide any insight to this?

UPDATE: Worse than I thought... 7 pages of valid results, 18 pages of errors.

Bill Deilke*
Send message
Joined: 4 Jun 08
Posts: 4
Credit: 5,174,815
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 18744 - Posted: 22 Sep 2010 | 3:49:09 UTC

I just aborted another one that ran on to long. Between the probs others are having and the ones I am experiencing I am pulling out until the bad WU's clear the catch.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18745 - Posted: 22 Sep 2010 | 13:20:53 UTC - in response to Message 18744.

Bill, you aborted a long task about 60% through its run.

All your tasks are running well on both systems.

Bill Deilke*
Send message
Joined: 4 Jun 08
Posts: 4
Credit: 5,174,815
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 18747 - Posted: 22 Sep 2010 | 14:21:45 UTC - in response to Message 18745.

Sorry, I already moved off, be back later. Trying to run SETI since 1999 has made me gun shy (crazy). So many projects so little time. Thanks for the response, I generally get criticism on other sites so I don't post.

Profile Saenger
Avatar
Send message
Joined: 20 Jul 08
Posts: 134
Credit: 23,657,183
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwat
Message 18818 - Posted: 4 Oct 2010 | 13:50:03 UTC
Last modified: 4 Oct 2010 | 13:50:32 UTC

I got an error on a TONI_CAPBIND as well, like many others. I can't see the original error in this thread any longer, as the WU is already purged, but here is my stderr:

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 98 (0x62, -158)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.34 GHz
# Total amount of global memory: 536150016 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile

14:58:03 (10049): called boinc_finish

</stderr_txt>
]]>

It failed after 2 seconds, so no real harm done, except that now a CASHIF_HIVPR is running, so far without problems. I hope it won't be affected, as I can't restart the puter without thrashing a checkpointless RNA-world WU after 24h, and I don't want to do that ;)

If this is a problem with the work generator (some input file not generated properly) perhaps it should be looked into somehow.
____________
Gruesse vom Saenger

For questions about Boinc look in the BOINC-Wiki

Profile Fred J. Verster
Send message
Joined: 1 Apr 09
Posts: 58
Credit: 35,833,978
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18919 - Posted: 12 Oct 2010 | 15:56:19 UTC - in response to Message 18818.
Last modified: 12 Oct 2010 | 16:00:19 UTC

Found some errors <core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.57 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [10,1,1][64,1,1]
Assertion failed: 0, file swanlib_nv.cpp, line 281

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>


g199r2-TONI_KKi4-2-200-RND6081_0 .
x199y2-TONI_KKi4-
2-200-RND6081_0

____________

Knight Who Says Ni N!

Post to thread

Message boards : Number crunching : Too many errors (may have bug)

//