Advanced search

Message boards : Graphics cards (GPUs) : GTX 295 and nothing but errors

Author Message
Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8191 - Posted: 4 Apr 2009 | 23:17:34 UTC

The machine's here

http://www.gpugrid.net/results.php?hostid=24557

Originally I could do workunits. Some would still fail. After upgrading to any driver later than 182.02, nothing will run. Every workunit runs for several seconds before erroring out. I don't believe it's hardware related, as I've been able to run units before. I've also tried dropping the clock rates by about 20%, and still the same. No games show any errors or instability either. Any suggestions would be very welcome as a lot of potential work is going undone.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8199 - Posted: 5 Apr 2009 | 6:59:22 UTC - in response to Message 8191.

The machine's here

http://www.gpugrid.net/results.php?hostid=24557

Originally I could do workunits. Some would still fail. After upgrading to any driver later than 182.02, nothing will run. Every workunit runs for several seconds before erroring out. I don't believe it's hardware related, as I've been able to run units before. I've also tried dropping the clock rates by about 20%, and still the same. No games show any errors or instability either. Any suggestions would be very welcome as a lot of potential work is going undone.

First suggestion, down-level the drivers to the ones that used to work ... see if they still work ...

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8205 - Posted: 5 Apr 2009 | 12:14:03 UTC - in response to Message 8199.

First suggestion, down-level the drivers to the ones that used to work ... see if they still work ...


I've tried going back to 181.20 which worked before somewhat, but still only errors.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8207 - Posted: 5 Apr 2009 | 12:48:02 UTC

This is really tough, as you already tried quite a few things.

- it looks like it's not hardware-related, as you already downclocked and the tasks fail immediately at the beginning, so it can't be heat either. Just to make sure: did you downclock everything, core, shader and memory?

- if it was only the new drivers, downgrading them should help. Maybe try to use "Driver cleaner" before installing the one again which is known to work? Or do you ave a system restore point prior to the new drivers?

- and just to be sure: did you try powering off and removing the power cord for >10 mins?

- does 3D Mark still run?

MrS
____________
Scanning for our furry friends since Jan 2002

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8210 - Posted: 5 Apr 2009 | 13:45:37 UTC - in response to Message 8207.

This is really tough, as you already tried quite a few things.

- it looks like it's not hardware-related, as you already downclocked and the tasks fail immediately at the beginning, so it can't be heat either. Just to make sure: did you downclock everything, core, shader and memory?


The lot was backed off. Rather than failing in 4 seconds, the units went on for 7 seconds instead before failing which, which suggests either slower processing before failing, or at least indicates a response. I can't believe it's heat either, the case is well ventilated, and the card temp starts at 55C and only rises to about 64C or so under full load.


- if it was only the new drivers, downgrading them should help. Maybe try to use "Driver cleaner" before installing the one again which is known to work? Or do you ave a system restore point prior to the new drivers?


Downgrading didn't help.


- and just to be sure: did you try powering off and removing the power cord for >10 mins?


The machines been reset, but not fully powered off admittedly. I'll try it later, can't hurt.


- does 3D Mark still run?


All games are completely fine so far, not artifacts, no instability.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8213 - Posted: 5 Apr 2009 | 14:31:29 UTC - in response to Message 8210.

It really sounds like the new driver changed something permanently, which is not undone by the driver downgrade. Now I remember, I think we had a similar report before.. not sure though, if the guy used a 295 or a 9800GX2. And I can't remember if the problem was solved..

You could try this or something similar to remove the NV driver completely (more "completely" than by just overwriting things with the older version).

MrS
____________
Scanning for our furry friends since Jan 2002

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8215 - Posted: 5 Apr 2009 | 15:31:39 UTC - in response to Message 8213.

It really sounds like the new driver changed something permanently, which is not undone by the driver downgrade. Now I remember, I think we had a similar report before.. not sure though, if the guy used a 295 or a 9800GX2. And I can't remember if the problem was solved..

You could try this or something similar to remove the NV driver completely (more "completely" than by just overwriting things with the older version).

MrS


I'm inclined to think it's something in the software environment as well. I'm going to drain my work cache, and clean out the BOINC installation completely and install afresh. As it stands now, the current folder has been inherited from an older XP 64 machine, then a Vista 64 machine, then the current machine, not to mention various development versions.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8217 - Posted: 5 Apr 2009 | 15:58:52 UTC - in response to Message 8215.

I'm inclined to think it's something in the software environment as well. I'm going to drain my work cache, and clean out the BOINC installation completely and install afresh. As it stands now, the current folder has been inherited from an older XP 64 machine, then a Vista 64 machine, then the current machine, not to mention various development versions.

If you are going to that extent, you might want to consider a fresh install of the OS.

In a case like this I would copy off the BOINC folder, install the OS, and before doing updates copy back the BOINC folder and installing the version of BOINC I wanted... then as I did updates and driver installs I would be running BOINC in the background... since XP Pro takes 3-6 hours to do all the SP updates and drivers that is a way to not lose productive time ...

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8220 - Posted: 5 Apr 2009 | 16:13:53 UTC - in response to Message 8217.

If you are going to that extent, you might want to consider a fresh install of the OS.


Only a minor hassle, with Seti, Cosmology and Einstein all suffering outages recently it won't take long, just need to wait a day or two.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8221 - Posted: 5 Apr 2009 | 17:31:35 UTC - in response to Message 8215.

This plan is OK, but I don't think it's the BOINC install. First argument: BOINC only launches GPU-Grid.. there's not much it could do to make the tasks error out, even if it wanted to. And the 2nd: it happened when you upgraded the vid driver, not when you upgraded BOINC, right?

MrS
____________
Scanning for our furry friends since Jan 2002

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8222 - Posted: 5 Apr 2009 | 18:03:47 UTC - in response to Message 8221.

This plan is OK, but I don't think it's the BOINC install. First argument: BOINC only launches GPU-Grid.. there's not much it could do to make the tasks error out, even if it wanted to. And the 2nd: it happened when you upgraded the vid driver, not when you upgraded BOINC, right?

MrS


Agreed, it won't be the only possibility I'll be following. I'll still be trying the driver cleaner before that as well.

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8230 - Posted: 5 Apr 2009 | 20:58:56 UTC

I've uninstalled anything nVidia related and ran the driver cleaner. Reinstalled afterwards, and still no improvement.


I did find this small app to perform CUDA based memory checks on the card.

http://www.softpedia.com/progDownload/CUDA-MemTest-Download-121066.html

Though my card passes the tests.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8249 - Posted: 6 Apr 2009 | 16:42:35 UTC

That's too bad. Now there are 2 options left: clean and reinstall BOINC and reinstall windows. Or could you test the card in some other computer, maybe even one which is known to work with GPU-Grid? Or test some single-GPU card in your PC?

MrS
____________
Scanning for our furry friends since Jan 2002

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8259 - Posted: 6 Apr 2009 | 18:49:44 UTC - in response to Message 8249.
Last modified: 6 Apr 2009 | 18:50:00 UTC

I'm not too sure it's something inherent to the operating system. I can run the Distributed.net CUDA client without issue.

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8261 - Posted: 6 Apr 2009 | 19:31:02 UTC

Removed BOINC completely, including wiping the folders and all the cruft they'd accummulated. I also turned off SLI.

And now I've two units running simultaneously and correctly.

When I tried without the SLI before, they would still break. So I'm not sure if it's the combo of reinstall and non-SLI or not that fixed it.

Spear
Send message
Joined: 28 Jan 09
Posts: 19
Credit: 15,297,622
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9183 - Posted: 2 May 2009 | 0:39:17 UTC

The last post that suggested it was working was premature. It actually errored on one core, though this time after more than a minute, rather than the prior 8 seconds. I do now have it fixed. A combination of 6.6.25 and the newest 185.81 drivers works.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9186 - Posted: 2 May 2009 | 1:42:40 UTC - in response to Message 9183.
Last modified: 2 May 2009 | 1:43:33 UTC

The last post that suggested it was working was premature. It actually errored on one core, though this time after more than a minute, rather than the prior 8 seconds. I do now have it fixed. A combination of 6.6.25 and the newest 185.81 drivers works.

Ok, be advised that 6.6.25 still has the "debt" bug and in a day or two (more if you are lucky) you will see your normal cache of 4 GPU Grid tasks shrink.

You have to set the reset debts flag in cc _config and stop and restart BOINC. You will get a new fresh load of tasks, rinse and repeat.

On my i7 I have to do this in as often as 24 but more like 36-48 hours ... on my Q9300 it took a couple weeks to get totally snarled up.

{edit}
Other people have had the opposite probem, the can't keep the right amount of CPU work on hand. NOt sure why one or the other ...

Profile X1900AIW
Send message
Joined: 12 Sep 08
Posts: 74
Credit: 23,566,124
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9190 - Posted: 2 May 2009 | 8:19:14 UTC - in response to Message 8230.

I did find this small app to perform CUDA based memory checks on the card.

http://www.softpedia.com/progDownload/CUDA-MemTest-Download-121066.html

Though my card passes the tests.


There is another CUDA test software (MemtestG80) from the folding forum, but I fear testing won´t solve your problem.
Announcing release: standalone memory tester for NVIDIA GPUs

More details in the readme.txt. (Up to now you must register to download the software.)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9191 - Posted: 2 May 2009 | 10:51:42 UTC - in response to Message 9186.

NOt sure why one or the other ...


Could be related to ressource share and the number of attached projects [which get either their cpu or gpu debts whacked].

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9195 - Posted: 2 May 2009 | 11:50:09 UTC - in response to Message 9191.

NOt sure why one or the other ...


Could be related to ressource share and the number of attached projects [which get either their cpu or gpu debts whacked].

Riotous night. I submitted what seemed like 20 posts to the lists this night. Hopefully with enough detail that they will finally start to correct some of the issues.

I did a long tacking session starting about 6 last night and finally proved that the code I told them was wrong, is wrong. There is two ways to fix it, lets see what happens with that. If they opt for the change it should make 4 CPU or better machines more stable in scheduling work and not starting a task and only working on it for a few seconds before starting another task it just downloaded.

Also found at least two issues with IBERCIVIS tasks, one of which is going to have to be corrected by the project the other is a flaw in BOINC when the DCF is 100 or more ...

Oh, and found more evidence of the debt imbalance problems. ... which is related to what we are talking about here in this note.

Post to thread

Message boards : Graphics cards (GPUs) : GTX 295 and nothing but errors

//