Advanced search

Message boards : Graphics cards (GPUs) : Computational Error

Author Message
mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9261 - Posted: 3 May 2009 | 19:02:09 UTC
Last modified: 3 May 2009 | 19:08:55 UTC

51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1

53+ hours of continuous computing, computer finishes the workunit, only to report a "COMPUTATIONAL ERROR" and the big fat "0" points awarded. Looks like I'm going to have to abort these longer workunits in the future. It's not worth the frustration of a computational error.

Name 51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1
Workunit 424829
Created 1 May 2009 13:02:02 UTC
Sent 1 May 2009 13:15:56 UTC
Received 3 May 2009 18:51:26 UTC


Server state Over
Outcome Client error
Client state Compute error
Exit status -185 (0xffffffffffffff47)
Computer ID 27410
Report deadline 6 May 2009 13:15:56 UTC
CPU time 1208.969
stderr out <core_client_version>6.4.7</core_client_version>
<![CDATA[
<message>
Can't write init file: -108
</message>
]]>


Validate state Invalid
Claimed credit 8076.97800925926
Granted credit 0
application version 6.64

I know you got some use out of this because it sent in a 51.94 MB completion file.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9287 - Posted: 4 May 2009 | 5:03:57 UTC

looks to me there is a error made by programming :

<core_client_version>6.6.26</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GT"
# Clock rate: 1674000 kilohertz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 8
# Number of cores: 64
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error.

</stderr_txt>
]]>

3th in arow which failed

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9291 - Posted: 4 May 2009 | 10:27:56 UTC - in response to Message 9287.

I just had one dump out on me a couple of minutes ago at the start of processing

http://www.gpugrid.net/workunit.php?wuid=431994

Looking at other threads, others have had this type go bang in the last 24hrs, maybe there is a bad one out there ?? Rare I know, but its an inescapable thought - some traditionaly "reliable" high volume crunches have had one go bang (eg Paul) - would be worth digging a little, it seems a bit strange ....

Regards
Zy

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9292 - Posted: 4 May 2009 | 10:46:17 UTC - in response to Message 9291.

Paul

The one I posted above is coming your way - you just downloaded it ..... :)

Regards
Zy

mscharmack
Avatar
Send message
Joined: 20 Aug 07
Posts: 18
Credit: 1,319,274
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9297 - Posted: 4 May 2009 | 13:29:14 UTC

I'd rather have a workunit dump at the begining rather than after it has completed its processing and is reported back to Grid servers. It was a waste of computing power and time.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9302 - Posted: 4 May 2009 | 17:52:58 UTC - in response to Message 9297.

Thats for sure - dont know about this one, could have been my error, will be interesting to see if Paul gets through it.

Regards
Zy

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9308 - Posted: 4 May 2009 | 19:07:20 UTC - in response to Message 9291.
Last modified: 4 May 2009 | 19:08:02 UTC

I just had one dump out on me a couple of minutes ago at the start of processing

http://www.gpugrid.net/workunit.php?wuid=431994

Um, you are not going to like this ... I am two hours in (2:18) and 18.3% done.

Running just fine on my GTX295 card ... 9:22 hours to go ...

For a small batch run I sure am getting a lot of them ...

Hmmm, I wonder if there is a memory issue?

However I do have this crash on 13-KASHIF_HIVPR_mon_ba3-6-100-RND2474_0

Though there is no real specific error, I got the Incorrect function. (0x1) - exit code 1 (0x1) error. It has already crashed for another person too ...

I don't have as many as I first thought, only about 5 completed, plus the one error and the two in work.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9310 - Posted: 4 May 2009 | 20:07:24 UTC - in response to Message 9308.

Interesting, nicely done :)

Hmmmm wonder why it went bang for me ? First one for a while, all seems ok, one of those things at present. Thanks for the heads up, I'll keep my eye open more than usual in case something lurketh.

Regards
Zy

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9325 - Posted: 5 May 2009 | 10:28:30 UTC

I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ...

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9329 - Posted: 5 May 2009 | 10:52:01 UTC - in response to Message 9287.

looks to me there is a error made by programming :

<core_client_version>6.6.26</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9600 GT"
# Clock rate: 1674000 kilohertz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 8
# Number of cores: 64
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error.

</stderr_txt>
]]>

3th in arow which failed


You are showing as running the (beta) 185.81 driver. I had problems with it too (and the swizzle_out error on one wu). I'm now running 182.50 which seems to work.
____________
BOINC blog

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9333 - Posted: 5 May 2009 | 12:55:21 UTC - in response to Message 9325.

I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ...


Had a quick look at past ones, I have done two other KASHIF_HIVPR WUs.

http://www.gpugrid.net/workunit.php?wuid=414191
http://www.gpugrid.net/workunit.php?wuid=421636

They went through ok. No idea if they were the "same" as such as the one that went bang. The latter may well have been something I did at the time, the CUDA card runs on my Home Office main beastie - normally my activities on it have not been an issue, may have been this time.

Just posted the above for completeness in case it throws up anything of interest.

Regards
Zy

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9334 - Posted: 5 May 2009 | 13:42:05 UTC
Last modified: 5 May 2009 | 13:43:01 UTC

I have usually these message and then wus errors :

05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_1 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_2 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_3 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent

on driver 185.81, boinc 6.6.20 and vista 64. :/
and my host : http://www.gpugrid.net/results.php?hostid=31684
with 1 gtx 260 and 1 8800 GT

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9336 - Posted: 5 May 2009 | 16:20:41 UTC

I had my only error ever a couple days ago on a KASHIF WU. They also take too long on slower cards.
Is there a way to set the client not to DL these or do we just have to watch for them?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9348 - Posted: 5 May 2009 | 21:22:33 UTC - in response to Message 9334.

I have usually these message and then wus errors :
05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_1 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_2 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent
05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_3 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent

on driver 185.81, boinc 6.6.20 and vista 64. :/
and my host : http://www.gpugrid.net/results.php?hostid=31684
with 1 gtx 260 and 1 8800 GT


You have it backwards: actually you get the error first, then the WU is marked as finished and then BOINC complains about the missing files. Which, I suppose, are not there because the WU was terminated unusually instead of successfully writing result files before gracefully shutting down.

and my Vista 64 machine is running 185.66 and 6.5.0 without problems. You might want to try this driver.

MrS
____________
Scanning for our furry friends since Jan 2002

[AF] Profanateur
Avatar
Send message
Joined: 25 Oct 08
Posts: 42
Credit: 42,812,268
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 9355 - Posted: 5 May 2009 | 23:55:37 UTC

So it's a pb with Boinc, or my drivers ?
Thanks.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9535 - Posted: 9 May 2009 | 13:10:40 UTC - in response to Message 9355.

Neither: this BOINC message tells us nothing ecept that there was an error.

Apart from this: since the 8th may all your WUs have errored out. What did you change? You clocked your 8800GT down, which shouldn't cause these errors. I suspect an upgrade to a new beta driver, which somehow messes things up. You might want to try a proven version like 182.50 or 182.08 and remove the newer one with some driver cleaner. You could also upgrade to BOINC 6.6.23, since it fixed at least one major bug in 6.6.20. You could also try with only one card installed to reduce the amount of variables in your config.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : Computational Error

//