Advanced search

Message boards : Graphics cards (GPUs) : Problems with WU's

Author Message
Profile darkstarz1
Send message
Joined: 18 Sep 09
Posts: 2
Credit: 10,104,596
RAC: 127
Level
Pro
Scientific publications
watwatwatwat
Message 16024 - Posted: 28 Mar 2010 | 11:12:49 UTC
Last modified: 28 Mar 2010 | 11:13:16 UTC

Running GPU-Grid without any problems until the last few days, now got over a dozen failed WU's, and with these messages;

28/03/2010 11:56:29 GPUGRID Computation for task a62-TONI_HERG79a-15-100-RND2348_0 finished
28/03/2010 11:56:29 GPUGRID Output file a62-TONI_HERG79a-15-100-RND2348_0_1 for task a62-TONI_HERG79a-15-100-RND2348_0 absent
28/03/2010 11:56:29 GPUGRID Output file a62-TONI_HERG79a-15-100-RND2348_0_2 for task a62-TONI_HERG79a-15-100-RND2348_0 absent
28/03/2010 11:56:29 GPUGRID Output file a62-TONI_HERG79a-15-100-RND2348_0_3 for task a62-TONI_HERG79a-15-100-RND2348_0 absent
28/03/2010 11:56:30 GPUGRID Started upload of a62-TONI_HERG79a-15-100-RND2348_0_0
28/03/2010 11:56:30 GPUGRID Started upload of a62-TONI_HERG79a-15-100-RND2348_0_4
28/03/2010 11:56:31 GPUGRID Finished upload of a62-TONI_HERG79a-15-100-RND2348_0_0
28/03/2010 11:56:31 GPUGRID Finished upload of a62-TONI_HERG79a-15-100-RND2348_0_4
28/03/2010 11:56:31 GPUGRID Started upload of a62-TONI_HERG79a-15-100-RND2348_0_7
28/03/2010 11:56:32 GPUGRID Finished upload of a62-TONI_HERG79a-15-100-RND2348_0_7
28/03/2010 11:57:20 GPUGRID Sending scheduler request: To fetch work.
28/03/2010 11:57:20 GPUGRID Reporting 1 completed tasks, requesting new tasks for GPU
28/03/2010 11:57:25 GPUGRID Scheduler request completed: got 1 new tasks
28/03/2010 11:57:27 GPUGRID Started download of a449-TONI_HERG79a-15-LICENSE
28/03/2010 11:57:27 GPUGRID Started download of a449-TONI_HERG79a-15-COPYRIGHT
28/03/2010 11:57:29 GPUGRID Finished download of a449-TONI_HERG79a-15-LICENSE
28/03/2010 11:57:29 GPUGRID Finished download of a449-TONI_HERG79a-15-COPYRIGHT
28/03/2010 11:57:29 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_1
28/03/2010 11:57:29 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_2
28/03/2010 11:57:33 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_1
28/03/2010 11:57:33 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_2
28/03/2010 11:57:33 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_3
28/03/2010 11:57:33 GPUGRID Started download of a449-TONI_HERG79a-15-pdb_file
28/03/2010 11:57:36 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_3
28/03/2010 11:57:36 GPUGRID Started download of a449-TONI_HERG79a-15-psf_file
28/03/2010 11:57:37 GPUGRID Finished download of a449-TONI_HERG79a-15-psf_file
28/03/2010 11:57:37 GPUGRID Started download of a449-TONI_HERG79a-15-par_file
28/03/2010 11:57:40 GPUGRID Finished download of a449-TONI_HERG79a-15-pdb_file
28/03/2010 11:57:40 GPUGRID Started download of a449-TONI_HERG79a-15-conf_file_enc
28/03/2010 11:57:41 GPUGRID Finished download of a449-TONI_HERG79a-15-conf_file_enc
28/03/2010 11:57:41 GPUGRID Started download of a449-TONI_HERG79a-15-metainp_file
28/03/2010 11:57:42 GPUGRID Finished download of a449-TONI_HERG79a-15-metainp_file
28/03/2010 11:57:42 GPUGRID Started download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_7
28/03/2010 11:57:43 GPUGRID Finished download of a449-TONI_HERG79a-15-a449-TONI_HERG79a-14-100-RND5529_7
28/03/2010 11:57:52 GPUGRID Finished download of a449-TONI_HERG79a-15-par_file
28/03/2010 11:57:52 GPUGRID Starting a449-TONI_HERG79a-15-100-RND5529_0
28/03/2010 11:57:52 GPUGRID Starting task a449-TONI_HERG79a-15-100-RND5529_0 using acemd2 version 603
28/03/2010 11:58:30 GPUGRID Computation for task a449-TONI_HERG79a-15-100-RND5529_0 finished
28/03/2010 11:58:30 GPUGRID Output file a449-TONI_HERG79a-15-100-RND5529_0_1 for task a449-TONI_HERG79a-15-100-RND5529_0 absent
28/03/2010 11:58:30 GPUGRID Output file a449-TONI_HERG79a-15-100-RND5529_0_2 for task a449-TONI_HERG79a-15-100-RND5529_0 absent
28/03/2010 11:58:30 GPUGRID Output file a449-TONI_HERG79a-15-100-RND5529_0_3 for task a449-TONI_HERG79a-15-100-RND5529_0 absent
28/03/2010 11:58:31 GPUGRID Started upload of a449-TONI_HERG79a-15-100-RND5529_0_0
28/03/2010 11:58:31 GPUGRID Started upload of a449-TONI_HERG79a-15-100-RND5529_0_4
28/03/2010 11:58:33 GPUGRID Finished upload of a449-TONI_HERG79a-15-100-RND5529_0_0
28/03/2010 11:58:33 GPUGRID Finished upload of a449-TONI_HERG79a-15-100-RND5529_0_4
28/03/2010 11:58:33 GPUGRID Started upload of a449-TONI_HERG79a-15-100-RND5529_0_7
28/03/2010 11:58:34 GPUGRID Finished upload of a449-TONI_HERG79a-15-100-RND5529_0_7

Any ideas anyone ?
____________

Profile darkstarz1
Send message
Joined: 18 Sep 09
Posts: 2
Credit: 10,104,596
RAC: 127
Level
Pro
Scientific publications
watwatwatwat
Message 16036 - Posted: 28 Mar 2010 | 18:36:06 UTC - in response to Message 16024.

core_client_version>6.10.18</core_client_version>
<![CDATA[
...and this, example from 1 invalid WU :<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.51 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: cannot open file "restart.coor"

</stderr_txt>
]]>


Validate state Invalid

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16077 - Posted: 30 Mar 2010 | 9:18:55 UTC - in response to Message 16036.
Last modified: 30 Mar 2010 | 9:19:56 UTC

Does the problem persist after rebooting?
Ps. Moving to the "GPU" thread.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16078 - Posted: 30 Mar 2010 | 9:23:03 UTC - in response to Message 16077.

Before (working)

# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.35 GHz

After (not working)

# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.51 GHz <--------

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16080 - Posted: 30 Mar 2010 | 9:29:32 UTC

The TONI_HERG tasks seem to be particularly problematic - see the hERG: information and issues thread.

[But since the run of errors I posted there, I have more recently had some successful runs - with no change in the host configuration]

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16081 - Posted: 30 Mar 2010 | 9:32:22 UTC - in response to Message 16080.

He gets errors on all types of WUs. Please check clock rate.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16083 - Posted: 30 Mar 2010 | 11:15:18 UTC - in response to Message 16081.

He gets errors on all types of WUs. Please check clock rate.

The clock rate is clearly a problem. But the message log in the OP, and hence the issue which prompted him to post in the first place, is exclusively about TONI_HERG.

I deliberately didn't speculate on the cause of the problem, just pointed out the correlation. From my POV, the jury's still out on whether T_H stresses GPUs more than other tasks, and hence selectively culls the weaker/hotter/badly configured specimens, or whether there's a bug in the application (a code-path which is only followed by particular parameter sets, for example).

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16085 - Posted: 30 Mar 2010 | 12:08:13 UTC - in response to Message 16083.

I didn't mean to be rude. Certain types of WUs may indeed turn out to be more sensitive to a variety of factors (including exposing rare bugs in drivers/hardware combinations, which are close to impossible to spot).

My impression is that, at least since the new application, the global error rate of HERGs is in line with the others.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 310,964
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16095 - Posted: 31 Mar 2010 | 0:29:25 UTC

I had another task crash on me tonight. Guess which type it was...

a68-TONI_HERG77a-17-100-RND2481_1

In this case, very many thanks (and that's genuine, not sarcastic). The aftermath solved a SETI Beta problem which has been bugging me, and the BOINC Alpha bug-report mailing list, for the last three weeks. I learned something new to me, and I think largely forgotten by the BOINC developers. It's in an area of code which is about to undergo major change: hopefully the write-up I've been able to submit as a result of this crash will enable safeguards to be built into the new code to replace the old ones which will no longer function.

Post to thread

Message boards : Graphics cards (GPUs) : Problems with WU's

//