Advanced search

Message boards : Graphics cards (GPUs) : Stability of the WUs

Author Message
RobertN
Send message
Joined: 18 Nov 09
Posts: 7
Credit: 52,996,450
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 23214 - Posted: 2 Feb 2012 | 12:31:11 UTC

Hey hey,

I love the new badge system and so but I am quite worried about something else. Lately an increasing amount of jobs ended with computing errors. At the moment it happens so often that I think I can better put my GPU on another project.
I dunno what causes it. I have the latest beta drivers from Nvidia installed here (290.53). It solved crashing of the display driver (caused by the hardware acceleration of FireFox I think). When the display driver crashes the GPUGRID task also crashes (have not seen an exception yet). Besides that, still too often the GPUGRID workunits crash for other reasons like: http://www.gpugrid.net/result.php?resultid=4879675

An incorrect function, how is that possible?

I would very much appreciate some effort into getting the workunits more stable.

Regards, iconized.

Evil Penguin
Avatar
Send message
Joined: 15 Jan 10
Posts: 42
Credit: 18,255,462
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 23216 - Posted: 2 Feb 2012 | 14:24:33 UTC

Are you overclocking the GPU?

Profile nenym
Send message
Joined: 31 Mar 09
Posts: 137
Credit: 1,185,946,693
RAC: 36,161
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23220 - Posted: 2 Feb 2012 | 16:09:54 UTC - in response to Message 23214.

Try to change the core clock to 880 - 890 MHz. I had the same problem with factory OCed GTX560Ti.

RobertN
Send message
Joined: 18 Nov 09
Posts: 7
Credit: 52,996,450
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 23224 - Posted: 3 Feb 2012 | 6:22:29 UTC

Yes, factory OC of 900 MHz, quite a bit higher than 822 MHz (the norm). I Will lower it. Thanks for the replies!

Grutte Pier [Wa Oars]~MAB...
Send message
Joined: 8 Jan 12
Posts: 20
Credit: 5,132,859
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 23253 - Posted: 5 Feb 2012 | 12:01:01 UTC
Last modified: 5 Feb 2012 | 12:01:59 UTC

At the moment both my cards can't even finish a WU on stock speed while the first few wu's were done with a reasonable OC.

FI :

Stderr output

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<message>
Het systeem kan het opgegeven pad niet vinden. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 1.53 GHz
# Total amount of global memory: 804847616 bytes
# Number of multiprocessors: 7
# Number of cores: 56
MDIO: cannot open file "restart.coor"
SWAN: FATAL : swanMemcpyDtoH failed

Assertion failed: 0, file swanlib_nv.c, line 390

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>


and


Stderr output

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 1.84 GHz
# Total amount of global memory: 1073283072 bytes
# Number of multiprocessors: 7
# Number of cores: 56
SWAN: Using synchronization method 0
MDIO: cannot open file "restart.coor"
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 460"
# Clock rate: 1.80 GHz
# Total amount of global memory: 1073283072 bytes
# Number of multiprocessors: 7
# Number of cores: 56
SWAN: Using synchronization method 0
MDIO: cannot open file "restart.coor"
ERROR: # Energies have become nan

called boinc_finish

</stderr_txt>
]]>

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23256 - Posted: 5 Feb 2012 | 15:12:40 UTC - in response to Message 23253.
Last modified: 5 Feb 2012 | 17:26:50 UTC

When you start getting errors you should make some observations, temps of GPU, CPU, board, fan speeds, task failure types, system usage at time of failure.
There are several generic things you can do,

Restart the system (stops system related runaway errors),
Increase fan speed / improve ventilation (reduces temps),
Free up a CPU core/thread (stops some heartbeat issues),
Reduce CPU clocks if the CPU is overclocked (reduces system temperature/motherboard and component overheating issues, especially chipset),
Reduce GPU clocks (start by trying to reduce the memory speed, then move onto the GPU if need be, but you shouldn't have to go below 10%)
Rollback, reinstall or upgrade drivers,
Increase GPU voltage very slightly.

If none of these work, there's more,
Clean the GPU and system,
Reset the Bios,
Re-seat the GPU (take it out, reboot, power down, re-seat the GPU),
Restore or reinstall the operating system.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Grutte Pier [Wa Oars]~MAB...
Send message
Joined: 8 Jan 12
Posts: 20
Credit: 5,132,859
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 23258 - Posted: 5 Feb 2012 | 18:31:09 UTC
Last modified: 5 Feb 2012 | 18:38:06 UTC

Well, I came from MW@H, with a GTX460FTW/920 and a GTX460SC/835 which also worked fine on S@H, because they ran out off wu's .
After a few wu's trouble arose but had done nothing to settings or whatever.
No probs with temps, if temps go up I take my compressor and clean the whole lot.
So back to stock speed, no results.
Tried SWAN_SYNC=0 and freed a core but than I saw "Suspend work when non-BOINC CPU usage is above 25%" which is a bit strange if you set gpu to work always.
This "Suspend" comes back irregularly without me having changed a thing.
Is getting extremely annoying.
Just getting a bit tired of trying everything again and again.
Perhaps later I'll try a clean install of everything and replacing a AM2+ mobo by a Asrock 870 Xtreme, 4GB DDR2 by 8 GB DRR3 anmem and a HD.

RobertN
Send message
Joined: 18 Nov 09
Posts: 7
Credit: 52,996,450
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 23265 - Posted: 6 Feb 2012 | 10:27:32 UTC - in response to Message 23258.
Last modified: 6 Feb 2012 | 11:12:17 UTC

My factory OC-ed 560 Ti normally:
core clock: 900 MHz
shader clock: 1800 MHz
memory clock: 2004 MHz

I have been running a bit lower for a couple of days now:
core clock: 800 MHz
shader clock: 1600 MHz
memory clock: 1800 MHz
These are all MSI Afterburner numbers so there might be some rounding errors.
I also ran a Video memory Stress test (vmt) with these settings and no problems.

I keep getting errors and all the latest errors are with the NATHAN units:
http://www.gpugrid.net/results.php?hostid=111996
All latest units producing errors gave this error:
Incorrect function. (0x1) - exit code 1 (0x1)

I don't have problems with other GPU projects (PrimeGrid, Mfaktc for GIMPS) with the factory OC.
So perhaps it is a problem with those work units?

Grutte Pier [Wa Oars]~MAB...
Send message
Joined: 8 Jan 12
Posts: 20
Credit: 5,132,859
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 23267 - Posted: 6 Feb 2012 | 15:00:10 UTC
Last modified: 6 Feb 2012 | 15:44:25 UTC

A clean upgrade to xp 64, installing a fresh boinc, 258.96 and the card on stock settings didn't result in no more errors.

Tried it before and got errors then too. It looks like GPUGRID is not for me.

Fyi : 23 nathan's and 1 gianni.

Grutte Pier [Wa Oars]~MAB...
Send message
Joined: 8 Jan 12
Posts: 20
Credit: 5,132,859
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 23268 - Posted: 6 Feb 2012 | 15:02:57 UTC
Last modified: 6 Feb 2012 | 15:37:47 UTC

Sjips, my sgs is stuttering

JSTL
Send message
Joined: 21 Dec 11
Posts: 2
Credit: 417,677,802
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23303 - Posted: 8 Feb 2012 | 5:03:32 UTC

I have the exact same issue (Assertion failed: 0, file swanlib_nv.c, line 390
) since I updated to Nvidia's beta drivers (295.51) Everything was working nicely prior to that.

I don't believe that's a coincidence.

RobertN
Send message
Joined: 18 Nov 09
Posts: 7
Credit: 52,996,450
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwat
Message 23373 - Posted: 9 Feb 2012 | 19:41:07 UTC

This is also funny:
http://www.gpugrid.net/workunit.php?wuid=3127645
http://www.gpugrid.net/workunit.php?wuid=3127506

But not correlated.

Profile Damaraland
Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 23560 - Posted: 19 Feb 2012 | 22:58:22 UTC - in response to Message 23303.

I have the same problem? with brand new GPU, hardware and distro

This was my first WU, but is crunching well at Einstein@Home

acemd2_6.14_x86_64-pc-linux-gnu__cuda31: swanlib_nv.c:388: error: Assertion `0' failed.

One weird thing I noticed is that the screen had a "scrambled" image.
I restated... let's see it tomorrow.

Post to thread

Message boards : Graphics cards (GPUs) : Stability of the WUs

//