Advanced search

Message boards : Graphics cards (GPUs) : 2 Jobs fail consecutively with an error I have never encountered before

Author Message
schizo1988
Send message
Joined: 16 Dec 08
Posts: 16
Credit: 10,644,256
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 9381 - Posted: 6 May 2009 | 16:33:17 UTC

I had my last 2 jobs fail and don't want it to continue but I don't understand the error message or what I can do if anything to prevent it. Fortunately one happened fairly early on but the other one was almost complete. Any advice would be welcome. Below is one of the mesages but they were both the same.

Thanks

<core_client_version>6.6.20</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTX 260"
# Clock rate: 1548610 kilohertz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
# Using CUDA device 0
# Device 0: "GeForce GTX 260"
# Clock rate: 1548610 kilohertz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 27
# Number of cores: 216
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79 : unknown error.

</stderr_txt>
]]>

Profile X1900AIW
Send message
Joined: 12 Sep 08
Posts: 74
Credit: 23,566,124
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9386 - Posted: 6 May 2009 | 19:14:31 UTC - in response to Message 9381.

# Clock rate: 1548610 kilohertz

Your OC settings changed, last sucessful workunits based on lower shader clock rates: # Clock rate: 1537627 kilohertz. Perhaps you changed just VRAM or other settings, which caused the instability on its one.

schizo1988
Send message
Joined: 16 Dec 08
Posts: 16
Credit: 10,644,256
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 9387 - Posted: 6 May 2009 | 19:14:38 UTC - in response to Message 9381.
Last modified: 6 May 2009 | 19:22:23 UTC

It has now become 5 jobs in a row that have failed. I will try removing the over clock but this card has been running for months. Of course I did do an upgrage from a beta of Windows 7 to the Release Candidate recently.

Now I am more confused as the last job on my i7 Dual 295's machine as well

Profile Dieter Matuschek
Avatar
Send message
Joined: 28 Dec 08
Posts: 58
Credit: 231,884,297
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9389 - Posted: 6 May 2009 | 19:40:21 UTC - in response to Message 9381.
Last modified: 6 May 2009 | 20:18:00 UTC

# Encounter 10-12 H-bond term

Doesn't this mean that the WU runs into a time limit?


I too have a problem with current WUs on one of my computers with a 9800 GTX+.
They are running way too slow.

Today I've aborted WU 631873 after 16 hours @ progress of 17%.
Now on the same card there is WU 636672 @ 0.479% after 1 hour 21 minutes!


I wonder whether this video card is damaged. But perhaps it is a problem with these WUs ...


EDIT

Problem identified:
These WUs sometimes 'hang'. With restarting those 'hangs' can be overcome.
____________

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9393 - Posted: 6 May 2009 | 19:57:39 UTC

I've had an "IBUCH_KID" WU error out recently, as have several others. Anyone able to process these without issue?

schizo1988
Send message
Joined: 16 Dec 08
Posts: 16
Credit: 10,644,256
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 9395 - Posted: 6 May 2009 | 20:20:30 UTC - in response to Message 9393.



This one was valid but it lists warnings about the same things that caused them to fail before, it unfortunate that you only see the warnings after the job finishes, so you can never use it to avoid it being invalid

<core_client_version>6.6.20</core_client_version>
<![CDATA[
<stderr_txt>
# Using CUDA device 2
# Device 0: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 2: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 3: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
# Using CUDA device 2
# Device 0: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 2: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 3: "GeForce GTX 295"
# Clock rate: 1512000 kilohertz
# Total amount of global memory: 939261952 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
# Time per step: 35.888 ms
# Approximate elapsed time for entire WU: 17943.750 s
called boinc_finish

</stderr_txt>
]]>

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9539 - Posted: 9 May 2009 | 13:44:13 UTC

- these warnings about "H-bond terms" don't mean anything to you, it's the science (hydrogen-something bonds, possibly to carbon atoms, which are now calculated by a new method)

- schizo, your OC is quite high. Also remember that the shader clock is actually changed in discrete steps of 54 MHz, you either get 1512 or 1566 MHz. Setting 1548 instead of 1537 likely pushed you over the threshold for a real clock of 1566 MHz, hence the failures.

- now you completed 2 at a setting of 1470 Mhz just fine.

- Klat, this is not related to the "IBUCH_KID"-issue (as you surely already noticed, since you also posted in the corresponding thread)

- Dieter, you're another victim of 6.6.20. Upgrade to 6.6.23 or downgrade to 6.5.0 or 6.4.7 to fix this problem.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : 2 Jobs fail consecutively with an error I have never encountered before

//