Advanced search

Message boards : Graphics cards (GPUs) : Recent task results/errors

Author Message
Andrew
Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10683 - Posted: 19 Jun 2009 | 15:16:39 UTC

I refer to my results at 'http://www.gpugrid.net/results.php?userid=10711'

My graphics driver just restarted (screen going blank then when it comes back a message bubble pops up) and indeed this just happened again while writing this message. Looking at the task type, it was a GIANNI this time, so rather than blaming the IBUCH as I was going to, I think I'm going to suspect the CD I'm ripping to MP3.

The CD ripping is hogging one of my two CPU cores - and with BOINC running CPU tasks too I could hear the fan on the gfx card spinning down and up again, as if it was not getting sent enough work. However BOINC CPU tasks have process priority 4 and acemd has priority 6, so you would have thought that the CUDA app would have taken priority on the non-CD-ripping core.

Maybe there was memory congestion? Who knows.

Anyway, probably not best to run CUDA apps while ripping CDs... (p.s. I have burnt at 4x before with BOINC running, but CDs read at 48x so it may be more strain)

I also have one of the ridiculously long tasks that was cancelled. Seems to have claimed 45k credit but been granted none?

Bambi_72
Send message
Joined: 27 May 09
Posts: 2
Credit: 1,136,521
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 10690 - Posted: 19 Jun 2009 | 17:23:46 UTC - in response to Message 10683.

My fubar WU's outnumber my successful ones by 4 to 1, so GPUGrid has seen the last of me after this batch or work until it becomes more stable. Seti CUDA work never seems to fail, maybe 1 in 100, whereas GPUGrid seems to be 4 or 5 fails for every success, and it usually fails after 12-16hrs crunching.

Andrew
Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10693 - Posted: 19 Jun 2009 | 18:02:16 UTC

Hmm how fascinating. I wonder which card you have.

My 8800GT's been crunching for about 3 months, 24-7; there were some problems relating to a bug in Nvidia's CUDA drivers (being a relatively new technology), but apart from that, nearly all of my tasks have succeeded. It's just today, on diff tasks so I think it's probably my computer being weird.

For the 200 series, you can see in other posts that people have been having trouble with different driver versions - I've played it safe and haven't 'upgraded' my drivers while it's working fine.

People will tell you to check any overclocking etc, but if you know what you're doing I assume you've checked that.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10703 - Posted: 19 Jun 2009 | 20:01:02 UTC

Andrew,

was the fan slowing down way before the black screen or shortly before / during it?

Bambi,

mainly you get the error "Incorrect function. (0x1) - exit code 1 (0x1)", which is a very general one and tells you something went wrong during computation. I don't want to be mean, but I suspect it's your system.. could be GPU OC, GPU temperature, CPU OC, CPU temperature, sytem or GPU memory, power supply, driver. Seti is tressing the GPU less than GPU-Grid and I suspect it's more fault tolerant in the sense that calculation errors often don't influence the task output. It's about finding interesting signals in noise data.. so if there are no signals then it's not important if some result is wrong.

You may want to compare your error rate with mine, for an OC'ed 9800GTX+ with Vista 64 and driver 185.66.. i.e. quite similar to your setup.

MrS
____________
Scanning for our furry friends since Jan 2002

Andrew
Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10715 - Posted: 20 Jun 2009 | 8:39:56 UTC - in response to Message 10703.

Hi,

To be honest, I actually had the fan on constant, but I was trying to communicate best as I can...I actually saw the temperature graph going up and down in hills. (that wasn't so hard, should have just said that lol). I have Rivatuner set up so that I leave it at the highest fan speed that doesn't bother me, and then it downclocks from 120% clocks to 100% if temp goes over 78, 100% to 75% if temp goes over 80 etc.
So at a constant fan rate, it looked like the GPU was only getting work periodically, i.e. the temp would go up to 78, then drop off for 15 seconds or so, then would get more work and the temp would rise again.

CDex always seems to hog the computer a bit even at normal priority. Maybe it's something to do with data transfer but I thought CD drives are on the southbridge while the GPU is on the northbridge right?

It happened shortly after a restart; I usually leave my PC crunching 24-7, but gave it a restart, and after some web browsing, started ripping some CDs, with BOINC going. Halfway through the first CD the screen blacked out (with just the cursor left)

I did actually try ripping the 3rd CD with BOINC going, to see whether it was consistent, but it was fine! (if that had problems i might have fingered the power supply)

Note that I haven't really had any problems before, apart from that FFT bug in the drivers you had trouble with and once when I tried to run another CUDA app at the same time.
It's working now, I'm not too fussed.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10720 - Posted: 20 Jun 2009 | 10:24:58 UTC - in response to Message 10715.

Thanks for your detailed report. I'm not sure what it means, but: if your ambient temp is borderline to one of your clock speed limits, would that mean the card is frequently changing clock speeds, maybe once every 15 to 30s? In the early days of Folding@home on ATI X1900 cards you couldn't change the clock speed on the fly at all - the WUs would crash immediately. So I wonder if this may cause occasional errors.

Regarding CDex: that's the tool i'm using as well and during the recent years I found cpu usage to be somewhere in the 10 - 30% range.. so it's the optical drive which limits the encoding and I can't feel any negative effects / hogging. I suspect the "error during ripping" was just a coincidence.

MrS
____________
Scanning for our furry friends since Jan 2002

Andrew
Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 10762 - Posted: 21 Jun 2009 | 18:24:12 UTC - in response to Message 10720.

I actually go by core temp - RivaTuner only checks in the up direction to downclock, and the upclock threshold is much lower, to avoid oscillating temperatures/clocks. Interesting theory about the changing of clock speeds causing the problem, though it probably happens about twice a WU (if it's a hot one) and hasn't caused problems before.

More likely that something in the system got messed up I guess. If something happens again I will take your advice and check whether the clock speed could have changed shortly before. Think it's unlikely, but then again that's how we discover such correlations, by checking.

btw CDex on my system does hog a whole core, but your system is probs beefier than mine. Mine was handed down (great - I don't have to buy one myself!) so it's about 2 years old now, with a 2.4Ghz E6600 Core 2 processor. The processor speed is enough for me - games are mostly GPU-bound and I don't tend to rip or transcode often.

Thanks for your posts.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10764 - Posted: 21 Jun 2009 | 20:12:49 UTC - in response to Message 10762.
Last modified: 21 Jun 2009 | 20:13:35 UTC

Alright, changing clock speed twice a WU should be fine. From your post it could have been much more frequently. So I agree, we can just attribute it to "something got messed up" for now.
Well, my system is not that much beefier: a Q6600 of the first hour running at 3 GHz. Luckily I don't *have* to upgrade it anytime soon, as the GPUs are doing all the major crunching now :D
Could be that you're using a different encoder / more stressful settings.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile (_KoDAk_)
Avatar
Send message
Joined: 18 Oct 08
Posts: 43
Credit: 6,924,807
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwat
Message 10944 - Posted: 1 Jul 2009 | 17:22:11 UTC
Last modified: 1 Jul 2009 | 17:23:15 UTC

http://www.gpugrid.net/result.php?resultid=900065

# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
Cuda error: Kernel [mshake_position] failed in file 'mshake.cu' in line 146 : unknown error.

http://www.gpugrid.net/result.php?resultid=897315

# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
Cuda error in file '..\cuda/cutil.h' in line 982 : unknown error.

http://www.gpugrid.net/result.php?resultid=883059

# Device 0: "GeForce GTS 250"
# Clock rate: 1836000 kilohertz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Device 1: "GeForce 9800 GTX/9800 GTX+"
# Clock rate: 1836000 kilohertz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 16
# Number of cores: 128
Cuda error in file '..\cuda/cutil.h' in line 982 : unknown error.
____________

Post to thread

Message boards : Graphics cards (GPUs) : Recent task results/errors

//