Advanced search

Message boards : Graphics cards (GPUs) : Encounter 10-12 H-bond term == Client error 0x1 ?

Author Message
Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10439 - Posted: 6 Jun 2009 | 22:12:01 UTC

Virtually all my currently failing jobs have this "Found zero 10-12 H-bond term" warning. I have examined other people's results and more than once, the 'buddies' will error out as well. Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ? One job (519558) had an out of memory error and was terminated by XP. I had disabled my 'faulty' GPU3, so this is NOT the 'faulty' one. My errors have occurred over several GPUs today. Do we have a GPU testing program ?
____________
Join team Bletchley Park, the innovators.

Ulf Ohlsson
Send message
Joined: 1 Jan 09
Posts: 20
Credit: 616,384
RAC: 0
Level
Gly
Scientific publications
watwatwat
Message 10440 - Posted: 7 Jun 2009 | 4:28:47 UTC

I.m running on CUDA device: GeForce 9800 GTX/9800 GTX+ (driver version 18608, compute capability 1.1, 1024MB, est. 85GFLOPS)
And have exact the same problem
OS is Vista 64

Only 5% of the wus completes normally

Neil A
Send message
Joined: 9 Oct 08
Posts: 50
Credit: 12,676,739
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 10445 - Posted: 7 Jun 2009 | 19:49:25 UTC - in response to Message 10440.

I have been experiencing this type of symptom for quite a while on one of my computers.... which as 2x GTX 260 Core 216 SC...backing off clocking doesn't seem to have helped, nor has reloading the driver, downgrading the driver or upgrading the driver.

I welcome a solution....
____________
Crunching for the benefit of humanity and in memory of my dad and other family members.

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10447 - Posted: 8 Jun 2009 | 0:52:07 UTC

I have a number of machines: 4 quaddies with GTS250's and an i7 with dual GTX260+. I run WinXP on them. Nothing is overclocked.

I found that the GTS250's wu will fail unless I run the 182.50 drivers. Even after the so called "work around" from the project team they still failed. The GTX260's seem to also fail, but not as often when I was running 185.85 drivers. I downgraded to 182.50 and that seemed to resolve the issue.

Also it seems that you have to uninstall the old drivers before installing new ones. I use Control Panel -> Add/Remove programs to uninstall.

I am running BOINC 6.6.28 on 3 of the quaddies and 6.6.33 on the other quaddie and the i7. There is a known bug with 6.6 (up to and including 28) to do with preempting tasks. 6.6.33 won't shut down the science apps on exit, but you can use Advanced -> Shutdown connected client and then exit.
____________
BOINC blog

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 10448 - Posted: 8 Jun 2009 | 7:27:42 UTC - in response to Message 10447.

Please check your driver version. It is possible that very new drivers have problems. You should use the one suggested by Nvidia for CUDA 2.1 unless you have a reason to use another one (for instance a game requiring a new driver).
See the join section. Driver 181.xx are stable.

gdf

Ulf Ohlsson
Send message
Joined: 1 Jan 09
Posts: 20
Credit: 616,384
RAC: 0
Level
Gly
Scientific publications
watwatwat
Message 10457 - Posted: 8 Jun 2009 | 23:52:10 UTC - in response to Message 10448.
Last modified: 9 Jun 2009 | 0:00:36 UTC

i tried every version of driver and still get the same errors, only a few wus finnishing correctly

I turned to run some wus for seti beta and have finnished 40 wus in last 20 hours, none of them had any errors at all

perhaps it would be possible to get some statistics of how many of the returned wus has errors for lets say 2 months time and if this shows a raising grapph of corrupted wus there might be some errors at server side
perhaps also adding info bout OS and CPUs
____________

SkyeHunter
Send message
Joined: 7 Mar 09
Posts: 12
Credit: 1,254,285
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 10459 - Posted: 9 Jun 2009 | 9:13:04 UTC

A bit less than 2 weeks ago I almost constantly had this kind of errors. Backing down the GPU clock (including GPU Memory clock) did resolve the issues. With one exception, everything gpugrid has thrown to the system concerned ever since, ran without a glitch, although a bit slower ....

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 10460 - Posted: 9 Jun 2009 | 10:21:26 UTC - in response to Message 10448.

Please check your driver version. It is possible that very new drivers have problems. You should use the one suggested by Nvidia for CUDA 2.1 unless you have a reason to use another one (for instance a game requiring a new driver).
See the join section. Driver 181.xx are stable.

gdf


I thought that we were going to be moving up to CUDA 2.2 or did the error on Nvidia's part put a stop to that?

I always receive this message on my results, but they don't error out.

Rob

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10468 - Posted: 9 Jun 2009 | 22:22:40 UTC

I returned all cards to stock speeds but have not crunched since. I did RMA my GPU3 (later GPU5 after swapping slots) card. It showed a hardware failure on one test with OCCT.

I still think there should be a testing / validation program for the shader processors. I found there is a bug in the memtestg80 program. It cannot run the same test on the same GPU twice in a row, the memory allocation always fails and after a little while the allocation fails with an 'unknown error'. This sounds familiar. Installing newer drivers should rule out driver errors on that one.
____________
Join team Bletchley Park, the innovators.

Ulf Ohlsson
Send message
Joined: 1 Jan 09
Posts: 20
Credit: 616,384
RAC: 0
Level
Gly
Scientific publications
watwatwat
Message 10470 - Posted: 10 Jun 2009 | 13:00:17 UTC - in response to Message 10459.

I tried to slow down my GPU processes but there still same problem with WUs from GPU grid
Seti @ hoem beta runs with 100% success.
Feels like waste of time to continue crunching for GPU-grid as long as this problem isn't solved.

core_client_version>6.6.33</core_client_version>
<![CDATA[
<message>
Felaktig funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9800 GTX/9800 GTX+"
# Clock rate: 1850000 kilohertz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
------------
<core_client_version>6.6.33</core_client_version>
<![CDATA[
<message>
Felaktig funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce 9800 GTX/9800 GTX+"
# Clock rate: 1850000 kilohertz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: cannot open file "restart.coor"

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10513 - Posted: 12 Jun 2009 | 21:46:52 UTC

@Hydro: seems like you're back up'n running. Was it "just" the downclocking?

@Ulf: you reverted to the older 182.50 driver, which is known to be good and you still have the error. So it does not look like software. I'd suppose hardware, although you also already tried downclocking. Further evidence: your WUs take a long time until they fail, which is typical for temperature / hardware failures just at the edge of stability. Seti doesn't use the GPU as hard as GPU-Grid does, so it could still run nevertheless. Try running 3D Mark and / or FurMark for an hour.

jrobbio wrote:
I thought that we were going to be moving up to CUDA 2.2


The next client is going to be 2.2, but no reason to hurry.

Hydro wrote:
Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ?


I think Ignasi said this warning is nothing to worry about (for us). Sounds like "no side effects are known".

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10526 - Posted: 13 Jun 2009 | 9:29:40 UTC - in response to Message 10513.

Hi, not sure, the absence of GPU3 may have something to do with it too. I currently have the remaining 6 mildly overclocked to 633 (as that is an evga advertized speed for G200 based cards). So far so good. GPU3(5) has been RMA'd and its slot is currently empty. Fans are at 89% with temperatures not over 65c. It was not a power issue as there is plenty.

Also not a driver issue (at least not with 3 cards) because it still is the same driver.

I may try linux 64 today.

regards H.

@Hydro: seems like you're back up'n running. Was it "just" the downclocking?

@Ulf: you reverted to the older 182.50 driver, which is known to be good and you still have the error. So it does not look like software. I'd suppose hardware, although you also already tried downclocking. Further evidence: your WUs take a long time until they fail, which is typical for temperature / hardware failures just at the edge of stability.

Hydro wrote:
Is it known what side effects this "Found zero 10-12" warning has ? Is it being investigated ?


I think Ignasi said this warning is nothing to worry about (for us). Sounds like "no side effects are known".

MrS


____________
Join team Bletchley Park, the innovators.

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10527 - Posted: 13 Jun 2009 | 10:18:54 UTC - in response to Message 10526.

Ubuntu 8 failed installation because of:
MP-BIOS bug, 8254 timer not connected to IO-APIC

Ubuntu 9 failed because it cannot detect my CD ROM, after booting from CD ROM...

____________
Join team Bletchley Park, the innovators.

Profile mikaok
Avatar
Send message
Joined: 16 Jan 09
Posts: 12
Credit: 639,094
RAC: 0
Level
Gly
Scientific publications
watwatwatwat
Message 10529 - Posted: 13 Jun 2009 | 11:11:22 UTC - in response to Message 10513.

Same error. Gpu isn't oc'ed and driver version is 182.08.

cheers
Mika

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10531 - Posted: 13 Jun 2009 | 14:11:23 UTC - in response to Message 10529.

Your error "Incorrect function. (0x1) - exit code 1 (0x1)" is a very general one which, roughly speaking, can happen due to anything going wrong during the calculation.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10532 - Posted: 13 Jun 2009 | 14:45:47 UTC - in response to Message 10531.

The error is caused by "Cuda error: Kernel [frc_sum_nb_forces] failed in file 'f
". Not much overclocked at 1700 MHZ compared to stock 1650 for a GTS 8800.

Again I think, even for overclocking tests, a good shader testing program would be useful.
____________
Join team Bletchley Park, the innovators.

Profile mikaok
Avatar
Send message
Joined: 16 Jan 09
Posts: 12
Credit: 639,094
RAC: 0
Level
Gly
Scientific publications
watwatwatwat
Message 10535 - Posted: 13 Jun 2009 | 15:05:23 UTC - in response to Message 10531.

Your error "Incorrect function. (0x1) - exit code 1 (0x1)" is a very general one which, roughly speaking, can happen due to anything going wrong during the calculation.

MrS


Ok, i thought this was the same error we were talking about. My bad.

Hydropower, this is a XFX version of the card, so it is guaranteed to work with these clocks.

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 10537 - Posted: 13 Jun 2009 | 16:16:59 UTC - in response to Message 10535.

That's what I mean only 3 %.
____________
Join team Bletchley Park, the innovators.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10539 - Posted: 13 Jun 2009 | 20:31:44 UTC - in response to Message 10532.

The error is caused by "Cuda error: Kernel [frc_sum_nb_forces] failed in file 'f"


I'm quite convinced that it's a transient error, which means at some point some calculation threw out a bad result. That means it wouldn't matter in which file and in which code line it happened.. unless we'd discover some regularity.

I'm not disagreeing with you, but IMO saying "The error is caused by.." probably misses the point.

Again I think, even for overclocking tests, a good shader testing program would be useful.


We don't have the perfect tool yet, but I think if a card survives FurMark for an hours without artefacts it should be fine for GPU-Grid. Yes, it doesn't run exactly the same code (but nothing except GPU-Grid itself could do that), so there might be problems where only certain combinations of instructions trigger errors. But FurMark stressed the cards so hard, it could almost be called a thermal virus and should easily generate 20 - 30°C more than GPU-Grid (at constant fan speed). This reduces the maximum stable frequency by quite a bit and thus errors are much more likely to show up.
Good old 3D Mark also has error detection built in. It's far from perfect, but if you can't finish it you know you're in trouble (it doesn't work the other way around, though).

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : Encounter 10-12 H-bond term == Client error 0x1 ?

//