Computational Error

Message boards : Graphics cards (GPUs) : Computational Error

Author	Message
mscharmack Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level Scientific publications	Message 9261 - Posted: 3 May 2009 \| 19:02:09 UTC Last modified: 3 May 2009 \| 19:08:55 UTC
	51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1 53+ hours of continuous computing, computer finishes the workunit, only to report a "COMPUTATIONAL ERROR" and the big fat "0" points awarded. Looks like I'm going to have to abort these longer workunits in the future. It's not worth the frustration of a computational error. Name 51-KASHIF_HIVPR_dim_ba1-2-100-RND4878_1 Workunit 424829 Created 1 May 2009 13:02:02 UTC Sent 1 May 2009 13:15:56 UTC Received 3 May 2009 18:51:26 UTC Server state Over Outcome Client error Client state Compute error Exit status -185 (0xffffffffffffff47) Computer ID 27410 Report deadline 6 May 2009 13:15:56 UTC CPU time 1208.969 stderr out <core_client_version>6.4.7</core_client_version> <![CDATA[ <message> Can't write init file: -108 </message> ]]> Validate state Invalid Claimed credit 8076.97800925926 Granted credit 0 application version 6.64 I know you got some use out of this because it sent in a 51.94 MB completion file.
	ID: 9261 \| Rating: 0 \| rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9287 - Posted: 4 May 2009 \| 5:03:57 UTC
	looks to me there is a error made by programming : <core_client_version>6.6.26</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9600 GT" # Clock rate: 1674000 kilohertz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 8 # Number of cores: 64 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error. </stderr_txt> ]]> 3th in arow which failed
	ID: 9287 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 9291 - Posted: 4 May 2009 \| 10:27:56 UTC - in response to Message 9287.
	I just had one dump out on me a couple of minutes ago at the start of processing http://www.gpugrid.net/workunit.php?wuid=431994 Looking at other threads, others have had this type go bang in the last 24hrs, maybe there is a bad one out there ?? Rare I know, but its an inescapable thought - some traditionaly "reliable" high volume crunches have had one go bang (eg Paul) - would be worth digging a little, it seems a bit strange .... Regards Zy
	ID: 9291 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 9292 - Posted: 4 May 2009 \| 10:46:17 UTC - in response to Message 9291.
	Paul The one I posted above is coming your way - you just downloaded it ..... :) Regards Zy
	ID: 9292 \| Rating: 0 \| rate: / Reply Quote

mscharmack Send message Joined: 20 Aug 07 Posts: 18 Credit: 1,319,274 RAC: 0 Level Scientific publications	Message 9297 - Posted: 4 May 2009 \| 13:29:14 UTC
	I'd rather have a workunit dump at the begining rather than after it has completed its processing and is reported back to Grid servers. It was a waste of computing power and time.
	ID: 9297 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 9302 - Posted: 4 May 2009 \| 17:52:58 UTC - in response to Message 9297.
	Thats for sure - dont know about this one, could have been my error, will be interesting to see if Paul gets through it. Regards Zy
	ID: 9302 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9308 - Posted: 4 May 2009 \| 19:07:20 UTC - in response to Message 9291. Last modified: 4 May 2009 \| 19:08:02 UTC
	I just had one dump out on me a couple of minutes ago at the start of processing http://www.gpugrid.net/workunit.php?wuid=431994 Um, you are not going to like this ... I am two hours in (2:18) and 18.3% done. Running just fine on my GTX295 card ... 9:22 hours to go ... For a small batch run I sure am getting a lot of them ... Hmmm, I wonder if there is a memory issue? However I do have this crash on 13-KASHIF_HIVPR_mon_ba3-6-100-RND2474_0 Though there is no real specific error, I got the Incorrect function. (0x1) - exit code 1 (0x1) error. It has already crashed for another person too ... I don't have as many as I first thought, only about 5 completed, plus the one error and the two in work.
	ID: 9308 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 9310 - Posted: 4 May 2009 \| 20:07:24 UTC - in response to Message 9308.
	Interesting, nicely done :) Hmmmm wonder why it went bang for me ? First one for a while, all seems ok, one of those things at present. Thanks for the heads up, I'll keep my eye open more than usual in case something lurketh. Regards Zy
	ID: 9310 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9325 - Posted: 5 May 2009 \| 10:28:30 UTC
	I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ...
	ID: 9325 \| Rating: 0 \| rate: / Reply Quote

MarkJ Volunteer moderator Volunteer tester Send message Joined: 24 Dec 08 Posts: 738 Credit: 200,909,904 RAC: 0 Level Scientific publications	Message 9329 - Posted: 5 May 2009 \| 10:52:01 UTC - in response to Message 9287.
	looks to me there is a error made by programming : <core_client_version>6.6.26</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using CUDA device 0 # Device 0: "GeForce 9600 GT" # Clock rate: 1674000 kilohertz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 8 # Number of cores: 64 # Amber: readparm : Reading parm file parameters # PARM file in AMBER 7 format # Encounter 10-12 H-bond term WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term. MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [fft_data_swizzle_out] failed in file 'c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu' in line 61 : unknown error. </stderr_txt> ]]> 3th in arow which failed You are showing as running the (beta) 185.81 driver. I had problems with it too (and the swizzle_out error on one wu). I'm now running 182.50 which seems to work. ____________ BOINC blog
	ID: 9329 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 9333 - Posted: 5 May 2009 \| 12:55:21 UTC - in response to Message 9325.
	I still wonder if it is not something to do with GPU memory size, mine is nearly twice yours ... Had a quick look at past ones, I have done two other KASHIF_HIVPR WUs. http://www.gpugrid.net/workunit.php?wuid=414191 http://www.gpugrid.net/workunit.php?wuid=421636 They went through ok. No idea if they were the "same" as such as the one that went bang. The latter may well have been something I did at the time, the CUDA card runs on my Home Office main beastie - normally my activities on it have not been an issue, may have been this time. Just posted the above for completeness in case it throws up anything of interest. Regards Zy
	ID: 9333 \| Rating: 0 \| rate: / Reply Quote

[AF] Profanateur Send message Joined: 25 Oct 08 Posts: 42 Credit: 42,812,268 RAC: 0 Level Scientific publications	Message 9334 - Posted: 5 May 2009 \| 13:42:05 UTC Last modified: 5 May 2009 \| 13:43:01 UTC
	I have usually these message and then wus errors : 05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished 05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_1 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent 05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_2 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent 05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_3 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent on driver 185.81, boinc 6.6.20 and vista 64. :/ and my host : http://www.gpugrid.net/results.php?hostid=31684 with 1 gtx 260 and 1 8800 GT
	ID: 9334 \| Rating: 0 \| rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 9336 - Posted: 5 May 2009 \| 16:20:41 UTC
	I had my only error ever a couple days ago on a KASHIF WU. They also take too long on slower cards. Is there a way to set the client not to DL these or do we just have to watch for them?
	ID: 9336 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9348 - Posted: 5 May 2009 \| 21:22:33 UTC - in response to Message 9334.
	I have usually these message and then wus errors : 05/05/2009 06:41:14 GPUGRID Computation for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 finished 05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_1 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent 05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_2 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent 05/05/2009 06:41:14 GPUGRID Output file 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0_3 for task 42-TONI_HIVPR_mon_ba25-4-100-RND6738_0 absent on driver 185.81, boinc 6.6.20 and vista 64. :/ and my host : http://www.gpugrid.net/results.php?hostid=31684 with 1 gtx 260 and 1 8800 GT You have it backwards: actually you get the error first, then the WU is marked as finished and then BOINC complains about the missing files. Which, I suppose, are not there because the WU was terminated unusually instead of successfully writing result files before gracefully shutting down. and my Vista 64 machine is running 185.66 and 6.5.0 without problems. You might want to try this driver. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 9348 \| Rating: 0 \| rate: / Reply Quote

[AF] Profanateur Send message Joined: 25 Oct 08 Posts: 42 Credit: 42,812,268 RAC: 0 Level Scientific publications	Message 9355 - Posted: 5 May 2009 \| 23:55:37 UTC
	So it's a pb with Boinc, or my drivers ? Thanks.
	ID: 9355 \| Rating: 0 \| rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9535 - Posted: 9 May 2009 \| 13:10:40 UTC - in response to Message 9355.
	Neither: this BOINC message tells us nothing ecept that there was an error. Apart from this: since the 8th may all your WUs have errored out. What did you change? You clocked your 8800GT down, which shouldn't cause these errors. I suspect an upgrade to a new beta driver, which somehow messes things up. You might want to try a proven version like 182.50 or 182.08 and remove the newer one with some driver cleaner. You could also upgrade to BOINC 6.6.23, since it fixed at least one major bug in 6.6.20. You could also try with only one card installed to reduce the amount of variables in your config. MrS ____________ Scanning for our furry friends since Jan 2002
	ID: 9535 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : Computational Error

	About	Science	Volunteers	Performance	Forum	Join us	Donate