Advanced search

Message boards : Number crunching : MJHarvey problems or 8800 GT and 295 driver problems?

Author Message
wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25022 - Posted: 13 May 2012 | 3:59:17 UTC
Last modified: 13 May 2012 | 4:00:19 UTC

I've been running tasks from the short queue this weekend on my 8800 GT. Most of them have been MJHarvey, and only one has completed successfully. Some of these WUs appear to be failing on multiple machines.

Here are the links to the status of each that has failed:

http://www.gpugrid.net/workunit.php?wuid=3412132
http://www.gpugrid.net/workunit.php?wuid=3412171
http://www.gpugrid.net/workunit.php?wuid=3412642
http://www.gpugrid.net/workunit.php?wuid=3415521
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25028 - Posted: 13 May 2012 | 10:22:46 UTC - in response to Message 25022.
Last modified: 13 May 2012 | 10:54:17 UTC

I see you swapped out a GTX580 and replaced it with a 8800GT!
Something probably worth mentioning; looking at your logs it appears that your 8800GT was running Long runs faster than short runs, but that's only because you were using a GTX580, and have since switched to an unrecommended card (8800 GT).
I would expect the 8800 GT to fail tasks.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25052 - Posted: 13 May 2012 | 22:37:56 UTC - in response to Message 25028.

I'd buy your argument if it were not for these tasks:
http://www.gpugrid.net/workunit.php?wuid=3412642
http://www.gpugrid.net/workunit.php?wuid=3412171
3412642 has completely failed on _all_ computers running various GPUs including 295, 460, 480, 560, 560TI. The other has failed on several machines including 460s, and 480s. It is clear that the problems with 3412642 are not GPU dependent.

____________

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25056 - Posted: 14 May 2012 | 3:11:53 UTC

BTW - Before I started this thread, and before you were so kind to rename it, I searched for others having difficulty with MJHarvey work units as I first questioned whether their might be something going on at GPUGrid that may make 8800 GTs incompatible with the work being sent out. How about you? Did you search before renaming the thread, or did you bother to even look at the status of the WUs that I posted to find that some of those tasks have failed on other GPUs?

My search revealed other users having problems with MJHarvey work units. In fact, there is one user who is new to GPUGrid who has processed only MJHarvey work units, and virtually all of that user's work units have failed. Note that user is not running an 8800 GT, and if I were that user, I would be questioning whether GPU grid is something that I want to be running.

If the project wants to exclude accepting work units from an 8800 GT, or other older cards, then the project is free to do so. I imagine it is not difficult to do.

In my opinion, the older cards are the ones most suited to the short queue tasks, and the newer, more capable cards are much better suited to the long queue tasks.

Now that the focus of this thread has been redirected to question the card rather than the work unit, perhaps it will take longer for the MJHarvey work units to get completed. To me, it looks like there are problems with these work units since so many of them are failing regardless of the GPU that is processing them.
____________

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25057 - Posted: 14 May 2012 | 4:30:59 UTC

Hi, the majority of 3412642 failed because of wrong driver, and failing to have monitor never sleep. 3 I'm not sure. May attach my 570 tomorrow to try and grab some to see what's up.

We'll see if the researchers mention anything tomorrow.

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,790,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25061 - Posted: 14 May 2012 | 7:58:47 UTC
Last modified: 14 May 2012 | 8:01:35 UTC

Its not new that are much of the newer mjharveys (witch had a error in the computing time) failing because one admin cant aborted them correctly. Beleave me you should be happy that they are erroing instant now. at the beginning there are computed 1%per hour (on fast cards!) for nothing. Btw im with you to exclude all cc1.1cards you have more errors onthem then as good ones cos they are not supported anymore, i swapped 9800gtxs and 8800gt to seti and einstein because it was waste of energy, cardlifetime and time ;)
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Crunching for my deceased Dog who had "good" Braincancer..

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25066 - Posted: 14 May 2012 | 12:03:53 UTC - in response to Message 25056.

I have read all the threads in the forum, and I am very much aware of the issues with CC1.1 cards. There is years of discussion about this. CC1.1 cards mostly fail work units here. The cards just don't handle the new code well.

When you look at each failed MJHarvey task you will find that most of the systems are using known bad drivers (295 and 296).
You are always going to get a few errors here and there, and one of the tasks even had a download error. At best this makes it difficult to say if there are issues with present MJHarvey tasks. Only the researchers can look at the stats and know for sure if there is an increase in failure rate of these MJHarvey tasks. I have seen a few failures on what looks like reasonably good systems, but on these systems one or two other tasks failed and most MJHarvey tasks completed successfully. Rest assured that the researchers do keep an eye on the success rates.

I think it's likely that when the new app is released (following ongoing testing) we will move to CC1.3 and above only. So CC1.1 cards will no longer get tasks. When we move to the CUDA 4.2 app, we will need recent drivers. Unfortunately the 295 drivers are capable of running CUDA4.2 apps. It might be the case that crunchers with the 295 and 296 drivers will not be granted work, but it's not as if these drivers don't work, they just have a bug so they need to be used with the workaround. It might be worth considering including code in the app to configure this workaround.

Older cards are relatively less powerful then newer cards broadly speaking, so yes, older cards are more suited to the shorter tasks, but one of the reasons they are more suited is that they are more prone to failing tasks. The risk of failure is exponential with run time. That said, the normal length tasks are really for any low to mid range card, new or old, and they can of course be run on high end cards (a bit of cruncher choice). Anyone with a high end card that only crunches on a part time basis would be better off running shorter tasks.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25071 - Posted: 14 May 2012 | 14:16:18 UTC - in response to Message 25066.

My apologies. I did not realize that these work units were the ones that were incorrectly aborted.

FYI - I was testing the 580 in this machine as I had not yet gotten all the components for the new build I am working on. The 580 went into the new build this weekend, and the new build should be coming on line in the next week or so.

Prior to testing the 580 in that machine for the past two weeks, I had virtually no problems running the 8800 GT on GPU Grid's short queue; however, I expect that that particular machine's history may have aged beyond where the successful WUs were shown. Personally, I think 8800 GTs can still do useful work on the short queue tasks; like each of you, I have no control over the fact that a number of bad WUs are basically left in the queue from an improper abort.

Interestingly enough, the 8800 GT completed one of those MJHarvey WUs this weekend. If you are interested in reviewing it's status, it is here. It completed in a reasonable amount of time, and IIRC, when the bad tasks were first aborted, I was wondering why since at least one completed on my 8800 GT in a very reasonable amount of time; I don't remember the exact time it took, however, it was much less than 24-hours, and it may have been as low as 8 hours. Maybe there's some useful information there, maybe not; maybe it's too late to do anything about it or learn something useful for future tasks.

I am sure you noted, too, that on this machine, I have installed the 301.24 drivers; at least in my case, that should rule out driver incompatibilities.

On computation errors in general GPUGrid WUs, one thing I have noticed is that if the BOINC screen saver is running, that will cause errors in GPUGrid WUs no matter the graphics card. I don't run screen savers at all on either of the machines I have running GPUGrid WUs, and I rarely have WUs fail due to computation errors. As well, I won't be running a screen saver on the new build, either.
____________

Profile dskagcommunity
Avatar
Send message
Joined: 28 Apr 11
Posts: 456
Credit: 817,790,789
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25072 - Posted: 14 May 2012 | 15:36:02 UTC - in response to Message 25071.
Last modified: 14 May 2012 | 15:36:21 UTC



Interestingly enough, the 8800 GT completed one of those MJHarvey WUs this weekend. If you are interested in reviewing it's status, it is here. It completed in a reasonable amount of time, and IIRC, when the bad tasks were first aborted, I was wondering why since at least one completed on my 8800 GT in a very reasonable amount of time; I don't remember the exact time it took, however, it was much less than 24-hours, and it may have been as low as 8 hours. Maybe there's some useful information there, maybe not; maybe it's too late to do anything about it or learn something useful for future tasks.



Not all WUs fail but i would say over 50% from my own older experience. And there where times where they computed 100%, but 20-23h short TONI WUs where the first 8xxx and 9xxx began to fail sometimes. then was a little timeline where it goes back to 100% work but then fails began again until today. And yes unfortunaly mjharveys was on only round 8 hours, your right and worked 100%. But ok, when there is no support left for some cardseries it can work but cannot too but nobody will change it to work again ;) perhaps you try again the next mjharvey series or when kashif WUs are available again they worked good too.
____________
DSKAG Austria Research Team: http://www.research.dskag.at



Crunching for my deceased Dog who had "good" Braincancer..

wiyosaya
Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 25073 - Posted: 14 May 2012 | 16:16:24 UTC - in response to Message 25072.

For quite a while, I stopped running GPUGrid on the 8800 GT. However, a few weeks before the bad MJHarveys, I restarted running the 8800 GT on the short queue. I had few, if any, failures. I did not note what series the WUs were.

Maybe when the MJHarveys run out, I'll give it another try. Seems like there are still bad MJHarveys in the queue - given they are failing on any GPU. Anyone know for sure if there are?

Thanks.
____________

Post to thread

Message boards : Number crunching : MJHarvey problems or 8800 GT and 295 driver problems?

//