Advanced search

Message boards : Number crunching : New batch KKi4

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 18862 - Posted: 7 Oct 2010 | 17:21:36 UTC
Last modified: 7 Oct 2010 | 17:21:56 UTC

Dears, this is the continuation of an experiment we'd like to publish soon. WUs are twice as large as the old "CAPBIND*" series.

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18868 - Posted: 8 Oct 2010 | 11:01:51 UTC - in response to Message 18862.

Dear Toni,

I have downloaded and processed already a few of this WU's.
Also a few cancelled within 1 minute.

Already known?

Good luck and good weekend,


____________
Ton (ftpd) Netherlands

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 18870 - Posted: 8 Oct 2010 | 12:58:33 UTC - in response to Message 18868.

There should be nothing new with these WUs (except their length). By "cancelled" you mean that they failed?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,893,911,970
RAC: 19,817,833
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18871 - Posted: 8 Oct 2010 | 13:18:16 UTC - in response to Message 18870.
Last modified: 8 Oct 2010 | 13:20:21 UTC

I had one this morning which has failed on two different machines so far: http://www.gpugrid.net/workunit.php?wuid=1966290

(Edit: but I've had one successful run, on the same machine, and another is currently at about 60%)

Profile Saenger
Avatar
Send message
Joined: 20 Jul 08
Posts: 134
Credit: 23,657,183
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwat
Message 18872 - Posted: 8 Oct 2010 | 13:29:53 UTC - in response to Message 18827.

And a TONI_KK broken as well.
stderr is this (my Linux):

stderr out

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 98 (0x62, -158)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.34 GHz
# Total amount of global memory: 536150016 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile

11:16:36 (3686): called boinc_finish

</stderr_txt>
]]>


and this (the other Windows):
stderr out

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.35 GHz
# Total amount of global memory: 919994368 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile

called boinc_finish

</stderr_txt>
]]>

____________
Gruesse vom Saenger

For questions about Boinc look in the BOINC-Wiki

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18873 - Posted: 8 Oct 2010 | 13:57:09 UTC - in response to Message 18870.

Dear Toni,

They failed within 1 minute (10-15 seconds processing).


____________
Ton (ftpd) Netherlands

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18874 - Posted: 8 Oct 2010 | 13:59:49 UTC - in response to Message 18871.

I have finished 4 and my systems are running at least one. Reasonable performance compared to the other tasks. However I also got one immediate failure:

3105670 1965540 8 Oct 2010 4:12:47 UTC 8 Oct 2010 8:55:18 UTC Error while computing 2.48 0.95 0.00 --- ACEMD2: GPU molecular dynamics v6.11 (cuda31)

Name f178r2-TONI_KKi4-0-200-RND1238_2
Workunit 1965540
Created 8 Oct 2010 3:32:44 UTC
Sent 8 Oct 2010 4:12:47 UTC
Received 8 Oct 2010 8:55:18 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 98 (0x62)
Computer ID 71363
Report deadline 13 Oct 2010 4:12:47 UTC
Run time 2.484375
CPU time 0.953125
stderr out

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.43 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 1: "GeForce GTX 470"
# Clock rate: 1.43 GHz
# Total amount of global memory: 1341718528 bytes
# Number of multiprocessors: 14
# Number of cores: 112
SWAN: Using synchronization method 0
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile

called boinc_finish

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0.00480620718015305
Granted credit 0
application version ACEMD2: GPU molecular dynamics v6.11 (cuda31)

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 18877 - Posted: 8 Oct 2010 | 18:53:23 UTC - in response to Message 18874.

Hi, for those getting: byte number 0: expected to read number of atoms - it must have been a glitch in mass-WU creation, let them die. Richard - I think your other failure was on a mobile card.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18878 - Posted: 8 Oct 2010 | 19:08:27 UTC - in response to Message 18877.

Thank you to everyone that reported this problem and thank you Toni for letting us know it is just a WU creation glitch.

As these errors are immediate they will have almost no impact on peoples RAC. To date I have only had one such error - most KKi4 WU's run well.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,893,911,970
RAC: 19,817,833
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18890 - Posted: 9 Oct 2010 | 17:49:25 UTC

One of my 9800GTs had a go at h230r2-TONI_KKi4-0-200-RND9586, but unfortunately crashed with an assertion failure at the bitter end, after more than 24 hours of work. C'est la vie.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18893 - Posted: 9 Oct 2010 | 20:08:52 UTC - in response to Message 18890.

Richard, you took that blow well.

Toni, perhaps Fermi-only long tasks would go down better; a failure after a few hours is no big deal but after a day it really bites, and not everyone is so understanding.

I've now had 5 failures, but all under 10sec. 16 other KKi4 tasks ran well.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18909 - Posted: 10 Oct 2010 | 12:58:34 UTC - in response to Message 18893.

Toni, perhaps Fermi-only long tasks would go down better; a failure after a few hours is no big deal but after a day it really bites, and not everyone is so understanding.

I've now had 5 failures, but all under 10sec. 16 other KKi4 tasks ran well.

My GTX 260/216 runs the TONI_KKi4 WUs well, in fact it runs everything well. The problem is with my three GT 240 cards. They won't run the TONI_KKi4 WUs. They don't like the TONI_HERGMETAXDOFE WUs either. They do run KASHIF_HIVPR, TONI_CAPBIND and IBUCH very well though.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18910 - Posted: 10 Oct 2010 | 13:50:32 UTC - in response to Message 18909.

I have had 4 finish on a GT240, and just one that failed after 2.46sec. Vista x64, all 512MB DDR5 cards.

Profile Fred J. Verster
Send message
Joined: 1 Apr 09
Posts: 58
Credit: 35,833,978
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18914 - Posted: 10 Oct 2010 | 19:42:42 UTC - in response to Message 18910.

One faulty WU , probably, as all hosts have failed on this one......?!

That's the only faulty WU {or result?}, I've seen, sofar.
It's meant to stay that way :)



____________

Knight Who Says Ni N!

Profile Fred J. Verster
Send message
Joined: 1 Apr 09
Posts: 58
Credit: 35,833,978
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18918 - Posted: 12 Oct 2010 | 15:34:16 UTC - in response to Message 18914.

Found 2 WU's , computed by 4 hosts, which all failed, 2 still have to Report.

h176r1-TONI_KKi4-0-200-RND5770 is giving problems as well!?

The faults I've seen so far, all come from the x999y1-TONI_KKi4-0-200-RND5770, batch.

Must be noticed by many others, concluding this from the # of INValid
Results.
dynamics v6.05 (cuda), dynamics v6.11 (cuda31) and dynamics v6.06 (cuda30), are involved, all with process exited with code 1 (0x1, -255).
All cards are involved, 240, 250, 470, 480 NVIDIA.



____________

Knight Who Says Ni N!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,893,911,970
RAC: 19,817,833
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18925 - Posted: 13 Oct 2010 | 14:01:36 UTC - in response to Message 18890.

One of my 9800GTs had a go at h230r2-TONI_KKi4-0-200-RND9586, but unfortunately crashed with an assertion failure at the bitter end, after more than 24 hours of work. C'est la vie.

This 9800GT host really doesn't like KKi4 - now failed g105r2-TONI_KKi4-6-200-RND6062 with the same

SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [700]
Assertion failed: 0, file swanlib_nv.cpp, line 121

error message. At least it only wasted 22 Ksec this time.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19161 - Posted: 1 Nov 2010 | 1:13:34 UTC - in response to Message 18925.

This WU might be bad,
http://www.gpugrid.net/workunit.php?wuid=2016815

Post to thread

Message boards : Number crunching : New batch KKi4

//