Advanced search

Message boards : Graphics cards (GPUs) : Recent problems for WUs on older GPUs

Author Message
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9642 - Posted: 11 May 2009 | 17:05:36 UTC

We are having problems with several workunits and GPUs which are NOT 260/275/285/295. As we test on newer cards, we have not spotted the problem before.

The problem appears only for workunits using Amber format (all the KASHIF ones).

We are now removed all that we could remove, but left some KASHIF out as they do run on newer cards just fine.

We are testing KASHIF_HIV_* on two 8800 cards under windows and Linux , running fine so far.

Keep you updated.

gdf

Blackbird74
Send message
Joined: 20 Nov 08
Posts: 3
Credit: 362,118
RAC: 0
Level

Scientific publications
watwatwatwatwat
Message 9670 - Posted: 12 May 2009 | 12:38:36 UTC - in response to Message 9642.
Last modified: 12 May 2009 | 12:39:03 UTC

I had a bunch of compute errors on my 8800GT, but then the latest KASHIF_HIVPR completed OK over a couple of days.
Full task list: http://www.gpugrid.net/results.php?userid=9833
Latest KASHIF_HIVPR WU completed fine: http://www.gpugrid.net/workunit.php?wuid=449234
Comp specs: http://www.gpugrid.net/show_host_detail.php?hostid=17613

Doesn't seem much rhyme nor reason to the fails other than the recent probs with WUs in general (blackout).

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9673 - Posted: 12 May 2009 | 13:17:43 UTC - in response to Message 9670.
Last modified: 12 May 2009 | 13:36:30 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

TomaszPawel
Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9678 - Posted: 12 May 2009 | 19:46:57 UTC - in response to Message 9673.

2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664

hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H....

Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ?
____________
POLISH NATIONAL TEAM - Join! Crunch! Win!

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9682 - Posted: 12 May 2009 | 20:36:42 UTC - in response to Message 9673.
Last modified: 12 May 2009 | 20:39:12 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

Or people avoiding them like the plague. People on our team have been reporting stuck and failed WUs like never before.

In the next few days we will perform a server update and application updates to use CUDA2.2.

Will we still be able to use our older non-CUDA2.2 cards?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9687 - Posted: 12 May 2009 | 21:34:54 UTC - in response to Message 9682.

Will we still be able to use our older non-CUDA2.2 cards?


That's just the software version and depends on the driver. There's also the CUDA hardware capability, which is the critical one. This one *should* stay as it was before (minimum of 1.1 required).

Thomasz, your GTX 260 is not exactly an older card (as stated in the first post of this thread).

MrS
____________
Scanning for our furry friends since Jan 2002

TomaszPawel
Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9689 - Posted: 12 May 2009 | 22:09:10 UTC - in response to Message 9687.

It is as clear as crystal ...

But it usually crunch 7-8h a WU not 17!!!

And in this tread this type of WU is mentioned so maby it is relevant?
____________
POLISH NATIONAL TEAM - Join! Crunch! Win!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9691 - Posted: 12 May 2009 | 22:23:59 UTC - in response to Message 9689.

Alright.. could be the *usual* 6.6.20 bug.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9693 - Posted: 12 May 2009 | 23:21:21 UTC - in response to Message 9691.

Alright.. could be the *usual* 6.6.20 bug.

MrS

Sadly I may have seen it on a 6.6.23 processed task. That means that the real problem has not been addressed, though the changes in 6.6.23 and later make it better, but not cured.

The Brain QC
Send message
Joined: 27 Oct 08
Posts: 27
Credit: 3,211,916
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 9699 - Posted: 13 May 2009 | 8:58:22 UTC - in response to Message 9693.

Have 5-KASHIF_HIVPR_dim_ba1-4-100-RND6112_0 using acemd version 664 running since 21 hours on 9800gx2, 68% done, never had such long wu on gpugrid, usually i make like 3/4 wus in 21 hour. Hope credit will be as great as the time it takes to compute ;).

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9701 - Posted: 13 May 2009 | 9:47:29 UTC - in response to Message 9673.
Last modified: 13 May 2009 | 9:58:36 UTC

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf


So do we need to upgrade to 185.85 drivers and cuda 2.2 dll's? Or will the app work out which cuda version and only use the instruction set that is supported?

Will GPUgrid download the cuda 2.2 dll's or will we need to put them somewhere (like the projects\gpugrid folder) when the new app is released?

Oh and seeing as you are changing the app, is there a chance you could report the driver version and the cuda version in the wu info. It might help with the debugging.

core_client_version>6.6.28</core_client_version>
<![CDATA[
<stderr_txt>
# Using CUDA device 0
# Device 0: "GeForce GTS 250"
# Clock rate: 1836000 kilohertz
# Total amount of global memory: 536543232 bytes
# Number of multiprocessors: 16
# Number of cores: 128
# Amber: readparm : Reading parm file parameters
# PARM file in AMBER 7 format
# Encounter 10-12 H-bond term
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
WARNING: parameters.cu, line 568: Found zero 10-12 H-bond term.
MDIO ERROR: cannot open file "restart.coor"
# Time per step: 46.163 ms
# Approximate elapsed time for entire WU: 46163.094 s
called boinc_finish

</stderr_txt>
]]>

____________
BOINC blog

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9707 - Posted: 13 May 2009 | 12:12:15 UTC
Last modified: 13 May 2009 | 12:18:19 UTC

Well i downloaded as allways both the driver and the cuda toolkit from nvidia site.
After the initial pause on gpugrid i have to report that i did not have a failing unit for a few days now.
Not sure if anyone else does download the Cuda toolkit or just the driver.
I am almost done with the test unit which finishes in about an 1/2 hour or so
I hope the new received IBUCH ones will finish also without issues.
If they all finish without error i start to get the feeling the problems are solved .... i hope :D

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9711 - Posted: 13 May 2009 | 13:42:36 UTC - in response to Message 9673.
Last modified: 13 May 2009 | 14:07:26 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP.

Here's just a few examples:

http://www.gpugrid.net/workunit.php?wuid=440561
http://www.gpugrid.net/workunit.php?wuid=442250
http://www.gpugrid.net/workunit.php?wuid=454479
http://www.gpugrid.net/workunit.php?wuid=449101
http://www.gpugrid.net/workunit.php?wuid=457871
http://www.gpugrid.net/workunit.php?wuid=458509

TomaszPawel
Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9713 - Posted: 13 May 2009 | 14:40:23 UTC - in response to Message 9711.
Last modified: 13 May 2009 | 14:41:25 UTC

2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664

hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H....

Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ?"


whell, now it crunch that WU 18H and it is 83%, it says 3h30min remaining...
____________
POLISH NATIONAL TEAM - Join! Crunch! Win!

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9714 - Posted: 13 May 2009 | 14:52:02 UTC - in response to Message 9713.

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9715 - Posted: 13 May 2009 | 15:01:22 UTC - in response to Message 9714.

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf



Are you saying that without 185 version drivers we will not be able to successfully do GPU Grid work.

I have card/box combinations that will not accept that version and run properly.

If 185 version driver and above is "required" to crunch here, I will be taking my farm to FAH.
____________
mike

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9716 - Posted: 13 May 2009 | 15:06:05 UTC

The test unit ended also without problem and new ibuchs on the way.
I haven't had any cancelled other then one being in queue for almost 2 days so nothing speical on that.
I am still running the 185.85 and boinc 6.6.28
Except the usual problems with boinc issues like fetch and such it runs stable for me, my slow 9600 Gt seems to do well.
But i had to lower my clock on my cpu since i had to disable my watercooling, the 9850 BE is a hothead because with the huge cooler on it becomes 55 C.
But its today extremly warm here i measured 31 C in the room ambient temp.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9717 - Posted: 13 May 2009 | 15:31:00 UTC - in response to Message 9716.

The test unit ended also without problem and new ibuchs on the way.
I haven't had any cancelled other then one being in queue for almost 2 days so nothing speical on that.
I am still running the 185.85 and boinc 6.6.28
Except the usual problems with boinc issues like fetch and such it runs stable for me, my slow 9600 Gt seems to do well.
But i had to lower my clock on my cpu since i had to disable my watercooling, the 9850 BE is a hothead because with the huge cooler on it becomes 55 C.
But its today extremly warm here i measured 31 C in the room ambient temp.

Your computers are hidden so how can we verify?

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9719 - Posted: 13 May 2009 | 17:06:28 UTC - in response to Message 9711.
Last modified: 13 May 2009 | 17:09:56 UTC

Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP.

Here's just a few examples:

http://www.gpugrid.net/workunit.php?wuid=440561
http://www.gpugrid.net/workunit.php?wuid=442250
http://www.gpugrid.net/workunit.php?wuid=454479
http://www.gpugrid.net/workunit.php?wuid=449101
http://www.gpugrid.net/workunit.php?wuid=457871
http://www.gpugrid.net/workunit.php?wuid=458509

Here's a new KASHIF_HIVPR that was just downloaded to me (and I aborted). Notice that it just caused an error on a GTX 260 {after running a long time I might add).

http://www.gpugrid.net/workunit.php?wuid=459189

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR. Take a look for yourself:

http://www.gpugrid.net/results.php?hostid=32169

It sure looks like the KASHIF_HIVPR problem also bites the faster cards, just not as often. Our team members have also been reporting the same problem on the GTX 260 and above. So it's documented. Any chance of getting this fixed?

TomaszPawel
Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9720 - Posted: 13 May 2009 | 18:14:44 UTC - in response to Message 9713.
Last modified: 13 May 2009 | 18:16:05 UTC

2009-05-12 15:08:32 GPUGRID Starting task 4-KASHIF_HIVPRFE_dim_ba1-2-4-RND6858_0 using acemd version 664

hmmm it is now 39.8% after 6:40H, and it's says that it remains 10H....

Is it normal on GTX260 and 182.08 and 6.6.20 and XP 32 ?"


whell, now it crunch that WU 18H and it is 83%, it says 3h30min remaining...

lol after 24h of crunching - 3600 pionts...
____________
POLISH NATIONAL TEAM - Join! Crunch! Win!

Profile Bymark
Avatar
Send message
Joined: 23 Feb 09
Posts: 30
Credit: 5,897,921
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwat
Message 9721 - Posted: 13 May 2009 | 18:20:00 UTC - in response to Message 9719.

Yep, the best driver for a 260 is Boinc 6.4.7 and driver 178.28. and cuda 2.
Working fine.........



Today the error rate for Kashif wus is lower, so it could have been a problem with drivers.

In the next few days we will perform a server update and application updates to use CUDA2.2.

gdf

Just finished looking at a LOT of KASHIF_HIVPR WUs. The situation is not improving at all and is not a driver issue. What happens is these WUs are downloaded and either fail or are aborted repeatedly until they happen to be assigned to a GTX 260 or above, then they complete. The problem is not fixed and is not improving. IMO it needs to be dealt with ASAP.

Here's just a few examples:

http://www.gpugrid.net/workunit.php?wuid=440561
http://www.gpugrid.net/workunit.php?wuid=442250
http://www.gpugrid.net/workunit.php?wuid=454479
http://www.gpugrid.net/workunit.php?wuid=449101
http://www.gpugrid.net/workunit.php?wuid=457871
http://www.gpugrid.net/workunit.php?wuid=458509

Here's a new KASHIF_HIVPR that was just downloaded to me (and I aborted). Notice that it just caused an error on a GTX 260 {after running a long time I might add).

http://www.gpugrid.net/workunit.php?wuid=459189

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR. Take a look for yourself:

http://www.gpugrid.net/results.php?hostid=32169

It sure looks like the KASHIF_HIVPR problem also bites the faster cards, just not as often. Our team members have also been reporting the same problem on the GTX 260 and above. So it's documented. Any chance of getting this fixed?


____________
"Silakka"
Hello from Turku > Åbo.

Alain Maes
Send message
Joined: 8 Sep 08
Posts: 63
Credit: 1,138,864,959
RAC: 127,062
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9722 - Posted: 13 May 2009 | 18:31:01 UTC - in response to Message 9714.
Last modified: 13 May 2009 | 18:31:51 UTC

Ubuntu 9.04 comes standard with driver version 180.44, which avoids so far to have to fiddle with manual interventions.

Wiil they follow before or after GPUGRID decides to require the 185 version drivers? If a manual update of the Linux community is required, please advise in advance.

Many thanks

Kind regards

Alain

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,186,977
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9723 - Posted: 13 May 2009 | 19:06:03 UTC - in response to Message 9719.

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,186,977
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9724 - Posted: 13 May 2009 | 19:12:45 UTC

In light of the issues with the older GPU's and the KASHIR_HIVPR WU's, what is the best version of nvidia driver to use?

I have been aborting them when I see them, to get them over to a 200 series as quick as possible. I don't think it is beneficial for the project for me to let this sit in my queue for 12 hours, then run for another several before failing anyway.

I'd prefer not to babysit, so should I roll back my current 185.66 to the last WHQL approved non-185.xx driver, which is 182.50?

I guess I could just try this and report the results, but I wanted to know if anyone has already tried this 182.50 driver w/ an older (non-200-series) card.

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9725 - Posted: 13 May 2009 | 20:21:33 UTC - in response to Message 9723.

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,186,977
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9727 - Posted: 13 May 2009 | 21:16:45 UTC - in response to Message 9725.

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665



:-) That one reports as "Aborted by user". So I don't think it errored out under normal circumstances -- it's was manually aborted.

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9728 - Posted: 13 May 2009 | 21:36:49 UTC - in response to Message 9715.

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf



Are you saying that without 185 version drivers we will not be able to successfully do GPU Grid work.

I have card/box combinations that will not accept that version and run properly.

If 185 version driver and above is "required" to crunch here, I will be taking my farm to FAH.



Is this query unworthy of an answer?
____________
mike

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9730 - Posted: 13 May 2009 | 21:56:37 UTC - in response to Message 9714.

CUDA 2.2 libs will be distributed with the application, but you will need to upgrade the driver to the latest 185 version.

gdf


Thanks.

I'd suggest a note in the news section on the home page. That way people can start organising things. I have already set GPUgrid to "no new work" so I can finish off what I have before doing the driver upgrades. I've got a few machines to do :)
____________
BOINC blog

Profile Aardvark
Avatar
Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9731 - Posted: 13 May 2009 | 22:07:07 UTC

Task ID 665546 had been running well along with another task. As I was about to run a program that would "use" the GPU I decided to suspend all tasks and exit Boinc. Once I had completed my task I launched Boinc, all tasks appeared still suspended. So far so good.I then resumed all tasks, and task 665546 immediately went to "compute error". I also had another task 652947 that had been running for 29 out of about 30 hours and failed (different machine). When I get the time I will compile a list of the failures and successes over the past few days.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9734 - Posted: 13 May 2009 | 22:38:33 UTC

which card/machine combinations are not possible to use the 185.85 version may i ask mike047 ?

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9736 - Posted: 13 May 2009 | 23:19:18 UTC - in response to Message 9727.
Last modified: 13 May 2009 | 23:20:16 UTC

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.

It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665


:-) That one reports as "Aborted by user". So I don't think it errored out under normal circumstances -- it's was manually aborted.

The user is one of my team members and he reported it as being stuck. It had processed for over twice as long as his other WUs and showed no progress. He was using BOINC client v6.6.28, not v6.6.20 so that wasn't the problem. :-)

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9737 - Posted: 13 May 2009 | 23:49:00 UTC - in response to Message 9727.

That same GTX 260 has only 3 recent failed WUs, all of them KASHIF_HIVPR.


It looks like you have a GTX 260 and an 8800GT. All three tasks failed while running on the 8800GT (device 1), not on the GTX 260.

You're right. Not my machine and I didn't see the 2 cards. But OK here's an example from a machine with only a GTX 260:

http://www.gpugrid.net/result.php?resultid=663665


:-) That one reports as "Aborted by user". So I don't think it errored out under normal circumstances -- it's was manually aborted.

Here's a bunch more for your viewing pleasure:

http://www.gpugrid.net/result.php?resultid=659111
http://www.gpugrid.net/result.php?resultid=664645
http://www.gpugrid.net/result.php?resultid=666952
http://www.gpugrid.net/result.php?resultid=647270
http://www.gpugrid.net/result.php?resultid=660927
http://www.gpugrid.net/result.php?resultid=666863


Certainly not as common as with the slower cards, but not at all hard to find.
The last 2 are test WUs...

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,186,977
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9738 - Posted: 14 May 2009 | 0:00:01 UTC - in response to Message 9737.

@Beyond - I didn't doubt you. :-)


@GDF/Admin:

Given these KASHIF_HIVPR seems to error out a lot, especially with "older, slower" cards, but also with new 200-series occasionally as well (as shown by Beyond), are no new ones going to be created?

I can understand cleaning out the queue, but I have gotten several today and with my cards I almost certainly expect them to error out. If I catch them in my queue, I try to abort them so they can move to a 200-series with a better change of finishing in a timely manner.

Is there any analysis from the project on why these particular WU's are an issue? I've read comments about the drivers possibly being an issue, but given the 2.2 CUDA software on the server will require these 185.xx drivers I expect to continue having issues with these WU's if they are still in queue. All others work fine.

Profile dataman
Avatar
Send message
Joined: 18 Sep 08
Posts: 36
Credit: 100,352,867
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9739 - Posted: 14 May 2009 | 2:10:31 UTC

As GPUGrid clearly does not want to put in much effort to support 8 and 9 series cards, I'm done here for now. I'd rather shut them down than to waste time and electricity in an endless circle jerk of BOINC versions and drivers. But hey, 3.7 million credits was a good run for me here. There will be a new GPU project out soon.
Sad really, as I think some of the science was worth doing here. :)
Ciao.
____________

Profile mike047
Send message
Joined: 21 Dec 08
Posts: 47
Credit: 7,330,049
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 9742 - Posted: 14 May 2009 | 6:51:56 UTC - in response to Message 9734.

which card/machine combinations are not possible to use the 185.85 version may i ask mike047 ?


I don't have that information at hand presently. Basically I use Ubuntu 8.04lts. The 260 and 250 cards have no trouble using 180.22 and might be able to use a higher driver without issue. Some of my 8800/9600gso/9800 cards will not accept any driver above 177.82. All mother boards are Gigabyte P35/45.

I don't know what the issues are with this project and I am willing "to do" a little work to be able to run this project. BUT, I am unwilling to babysit and periodically change drivers to suit a project that is becoming unwilling to respond to my queries and the queries of others.

Unfortunately I have invested in many Nvidia cards that at the present cannot be used else where in Boinc. FAH is the only other place that can use my cards. I have one box working there now and it has run absolutely trouble free with NO intervention on my part. The + to FAH is that my internet is not shut down when it has to upload, the 50+m uploads from here shut my internet down...I know that is not a project fault but it is an issue for me.

This is a good project with good science but it has gotten away from communicating with the participants in a timely manners. IMHO the project has slipped badly from where it was several months ago.
____________
mike

Profile JockMacMad TSBT
Send message
Joined: 26 Jan 09
Posts: 31
Credit: 3,877,912
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9743 - Posted: 14 May 2009 | 8:15:20 UTC - in response to Message 9742.
Last modified: 14 May 2009 | 8:20:00 UTC

I can confirm my BFG GTX-260 192 Shader card is also getting alot of these errors with 185.81.

One example
____________

Profile JockMacMad TSBT
Send message
Joined: 26 Jan 09
Posts: 31
Credit: 3,877,912
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9751 - Posted: 14 May 2009 | 13:16:48 UTC - in response to Message 9743.

Oh and SETI has nVidia support so there is another BOINC project.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9752 - Posted: 14 May 2009 | 14:10:05 UTC - in response to Message 9751.
Last modified: 14 May 2009 | 14:18:20 UTC

We have tested with drivers 185.xx on a 8800GT. All the WUs fail.
With driver 180.xx all WU are fine.

So, we can just suggest to downgrade to older drivers (180.xx) seem to work.

We have reported the issue to Nvidia.

gdf

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9761 - Posted: 14 May 2009 | 16:06:59 UTC - in response to Message 9736.
Last modified: 14 May 2009 | 16:13:19 UTC

The user is one of my team members and he reported it as being stuck. It had processed for over twice as long as his other WUs and showed no progress. He was using BOINC client v6.6.28, not v6.6.20 so that wasn't the problem. :-)

Yes, and maybe no ...

6.6.20 stunk in this regard... it really sucked swamp water ...

6.6.23 and later, *I* for one thought, fixed it ... now I am not so sure.

What I ***THINK*** happened is that most of the causes have been cleaned up ... but sometimes something bad happens. And THEN, you get a task that runs long.

There are still issues with the way that the resource scheduling is done. I am banging my head on the wall about things that *I* think I can clearly demonstrate to be patted on the head and told to go 'way you bother me ...

I mean, just last night I had five tasks all started and die in less than a second. At the moment the answer is that this is not possible. My 2,200+ log file of those two seconds notwithstanding ...

Anyway, ... I am far less sanguine about how "fixed" we are ...

{edit}

An example: 12-TONI_HIVPR_mon_ba20-7-100-RND1398_0 and that was run on a 6.6.25 client ... 182.50 drivers I think at the time. 115 ms step size ...

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9830 - Posted: 16 May 2009 | 9:38:33 UTC - in response to Message 9761.

We have managed to replicate the problem on one of our machines.
This should lead to a solution soon.

Be patient.

gdf

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9836 - Posted: 16 May 2009 | 10:42:06 UTC - in response to Message 9830.

We have managed to replicate the problem on one of our machines.
This should lead to a solution soon.

Be patient.

Oh, now we have to be patient too???? :)

Its good news GDF ... thanks for the note.

Toby Broom
Send message
Joined: 11 Dec 08
Posts: 23
Credit: 197,557,443
RAC: 6
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9843 - Posted: 16 May 2009 | 11:54:10 UTC

I worked out the numbers on my computers, they all run 182.50 drivers.

ID: 30829 (8800GT 256Mb) - 11% failure rate
ID: 33373 (9800GX2 512Mb) - 46% failure rate
ID: 26481 (9800GX2 & GTX260) - 29% failure rate
ID: 34636 (9800GX2 & 8800GT) - 18% failure rate

It seems strange that the 8800GT is the most reliable card give the issues. 26481, did have an issue that I know was my fault, so that's a little higher than expected.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9844 - Posted: 16 May 2009 | 11:55:45 UTC
Last modified: 16 May 2009 | 11:57:23 UTC

Hmm i am not convinced its just the drivers i started under win xp with 182.50 driver and boinc 6.6.28 but again i see the ibuch unit hang on 64.688% for more then an hour after 13 hours of calculation.
So i start to believe this one is going to crash as well

Matteo
Send message
Joined: 30 Mar 09
Posts: 1
Credit: 176,953
RAC: 0
Level

Scientific publications
watwatwat
Message 9849 - Posted: 16 May 2009 | 12:25:17 UTC

My card is an 9800GTX whith 185.82 driver and Boinc 6.6.20.
I don't want to downgrade drivers, so, in the mean time, i suspended any WU's for GPUGRID.

I hope to see good news asap.

Sorry for my bad english...

Greetings, Matteo

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9855 - Posted: 16 May 2009 | 13:18:55 UTC - in response to Message 9844.

Hmm i am not convinced its just the drivers


GDF said the problems appear with 185.xx and don't show up with some 180.xx, which apparently noone else is still using. This does not mean that 182.xx is fine and I think the usual "KASHIF_HIVPR" and "IBUCH_KID" problems definitely affect 182.50.
It seems to be a problem with the driver, triggered by some new WUs.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9858 - Posted: 16 May 2009 | 14:11:17 UTC - in response to Message 9855.

Hmm i am not convinced its just the drivers


GDF said the problems appear with 185.xx and don't show up with some 180.xx, which apparently noone else is still using. This does not mean that 182.xx is fine and I think the usual "KASHIF_HIVPR" and "IBUCH_KID" problems definitely affect 182.50.
It seems to be a problem with the driver, triggered by some new WUs.

Well, I have some of these named tasks running on my 9800GT and the GTX295s ... but they don't seem to want to run on the new GTX260 or my GTX280 ... As far as I know, at the moment I am running 182.50 everywhere ...

I suppose I could roll back to the 180.xx to see if I can get a task and if it dies ... heck, nothing else seems to be bothering this problem.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9860 - Posted: 16 May 2009 | 14:20:20 UTC - in response to Message 9858.

Sorry, not very specific post. Not all WUs with those names are affected, e.g. see here. "KASHIF_HIVPR_mon" and "KASHIF_HIVPR_dim" have been fine for me.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9861 - Posted: 16 May 2009 | 14:32:34 UTC - in response to Message 9860.

Sorry, not very specific post. Not all WUs with those names are affected, e.g. see here. "KASHIF_HIVPR_mon" and "KASHIF_HIVPR_dim" have been fine for me.

MrS

Well, I just rolled the driver back to 180.4 and still got an invalid function. THe tasks die immediately. gettingevery depressed ... can't tell if it is my new systems or bad tasks ...

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9863 - Posted: 16 May 2009 | 14:58:51 UTC
Last modified: 16 May 2009 | 15:01:12 UTC

This tasks: p1480000-RAUL_pYEpYI1605-0-10-RND5295_0 started up and I have 5:10 or so on the clock ... so, unlike all the rest, finally got one running. It is running on the new MB, but the old GPU.

SO, this batch of tasks is so bad that most of them won't run on anything ... though my GTX 295s seem to be rolling on ...

{edit}

I was wrong ... it is on one of the new GTX 260 cards ...

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9867 - Posted: 16 May 2009 | 17:29:21 UTC - in response to Message 9863.
Last modified: 16 May 2009 | 17:30:11 UTC

GIANNI_FB's have come in for some flak lately , so thought I would post a successful one as comparator. The stop/starts in there were me, due to non-BOINC related stuff.

http://www.gpugrid.net/result.php?resultid=677172

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9878 - Posted: 16 May 2009 | 21:07:46 UTC

Glad your new rig made it through one WU successfully! The oters don't look too well, though. They error on most other hosts as well, but 3 have been finished by other GT200 cards. One of them uses 185.85, but I can't see the others.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9883 - Posted: 16 May 2009 | 21:37:49 UTC - in response to Message 9867.
Last modified: 16 May 2009 | 21:39:14 UTC

GIANNI_FB's have come in for some flak lately , so thought I would post a successful one as comparator. The stop/starts in there were me, due to non-BOINC related stuff.

http://www.gpugrid.net/result.php?resultid=677172

Regards
Zy

And here's a 205-GIANNI_FB that failed on the same machine after running a LONG time:

http://www.gpugrid.net/result.php?resultid=677771

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9891 - Posted: 17 May 2009 | 0:52:05 UTC - in response to Message 9878.

Glad your new rig made it through one WU successfully! The oters don't look too well, though. They error on most other hosts as well, but 3 have been finished by other GT200 cards. One of them uses 185.85, but I can't see the others.

MrS

I think I had TWO problems, one was OC got turned on by mistake and the automatic mode OC probably tried to do too much. What it broke is not entirely clear to me. It may also have been the BIOS ... I flashed that with the latest and turned off the OC mode at the same time so it is hard to know which it was.

The second problem was of course the bad tasks which would have failed with the other error messages if I had not had problem one on both rigs.

Now I am running into power limits (again) ...

I really gotta call that electrician to change my old 3 phase 230 V UPS socket into a 30A 115 supply...

Profile [AF>Amis des Lapins]Gillo...
Send message
Joined: 21 Mar 08
Posts: 7
Credit: 24,394,688
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9896 - Posted: 17 May 2009 | 2:34:55 UTC

The similar for me,

http://www.gpugrid.net/result.php?resultid=678214
http://www.gpugrid.net/result.php?resultid=664263

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9897 - Posted: 17 May 2009 | 2:44:10 UTC - in response to Message 9896.

The similar for me,

http://www.gpugrid.net/result.php?resultid=678214
http://www.gpugrid.net/result.php?resultid=664263

I don't understand ... you don't like valid tasks?

Profile [AF>Amis des Lapins]Gillo...
Send message
Joined: 21 Mar 08
Posts: 7
Credit: 24,394,688
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9898 - Posted: 17 May 2009 | 3:06:09 UTC - in response to Message 9897.

Hello,

oops
http://www.gpugrid.net/workunit.php?wuid=466073
http://www.gpugrid.net/workunit.php?wuid=458046

give me 5500 points for 17/24 hours of crunch (260GTX 216 SPU O/C stable
)

Mark Henderson
Send message
Joined: 21 Dec 08
Posts: 51
Credit: 26,320,167
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9899 - Posted: 17 May 2009 | 4:24:17 UTC
Last modified: 17 May 2009 | 5:03:20 UTC

I maybe lucky but I am having very few problems. 185.85 drivers, XP64, 2 EVGA 260s, Boinc 6.6.28 I had 1 compute error yesterday but that was my fault for suspending right as it started and unsuspending a couple of seconds later, and a couple of others that everyone else in the quorum errored out on.
I have heard of hanging WUs but have never had one of those either.
but 99 percent of the time it runs great.
I always take great care to run driver sweeper in safe mode after uninstalling Nvidia drivers before updating. I do not know if this matters that much though.
I also never let the gpu temps get over 65c with moderate OC.
Also 4 cpu units of either seti astropulse, einstein or abc running along side at same time always.
I had a 9800gt in this computer for about a month that ran good as well. Replaced it with a 260 this week.

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9900 - Posted: 17 May 2009 | 9:59:36 UTC
Last modified: 17 May 2009 | 10:03:20 UTC

Now i have been able to save a few hanging units
It seems to work for me first make sure to disable keep units in memory under options.

I pause all other units available then i pause the unit which does not move in progress then push it to continue, i know it costs alot of time because it jumps back to some point in time.

Untill now i had 4 units which kept at a certain % and did not move in more then half an hour so i started messing with them.

When i woke up this morning i saw a 92-kashif_hivpr_dim unit reporting to have done 0.700 % in 7 hours so i paused it, ofcourse it jumped back to 0.426 % when it started over but now did in half an hour 1.5 %.

So the reason seems to be the units get stuck in the calculations and finally error out if this takes too long.

But i can tell you its a pain in the ass problem when they hang you hardly notice, we don't have time to watch them all day if the units progress or not.

[boinc.at] Nowi
Send message
Joined: 4 Sep 08
Posts: 44
Credit: 3,685,033
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 9901 - Posted: 17 May 2009 | 10:27:29 UTC

Now I have the fourth error WU in a row. :-(((

http://www.gpugrid.net/result.php?resultid=678849
http://www.gpugrid.net/result.php?resultid=679319
http://www.gpugrid.net/result.php?resultid=680211
http://www.gpugrid.net/result.php?resultid=680860

It wastes a lot of GPU-time for scientific knowledge!
It costs a lot of credits...
It costs a lot of fun...

Is GPUGRID going to be used only with newer cards?

Attention! Sarcasm!
Is there a hidden deal with NVIDIA to push cards with G200...?


My System
Q9550 @ 3.4
8800 GT @ stock
4 GB
Windows 7 RC 64 Bit
185.85

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9903 - Posted: 17 May 2009 | 12:22:29 UTC - in response to Message 9901.

Nowi,

take a look further up in this thread.

[AF>EDLS>BIOMED],

take a look here. I edited the title to make it more clear that this problem also affects previous versions.

Mark,

I also get few errors, but if I look at my tasks I see that these are "friendly" WUs, almost none of the trouble makers. This makes it harder to blame it on config differences..

uBronan,

I think what you're doing is in the end similar to a BOINC restart. It's good to know that this helps, but still it's *irritating* that it seems to happen so often. Which BOINC version do you run? The thing is, i'm running 6.5.0, 185.66 and Vista 64 and from looking at my results I think I did not have a single hanging WU. Every day 2 succesful returns, except when errors occured or with the one "kashif_hivpr_dim" that I had. It registered a runtime of 89839s = 24:57h and gave 10096 credits. The interval between the previous result and this one is 24:55h, so I don't think it was hanging at all.

Of course, just because I ran one of them alright does not mean the problem doesn't exist. I just can't see the pattern.. is it the 6.6.x clients? It's not all of the WUs, it's not all of the 185 drivers, it's not all of the G9x GPUs. What's left?

Paul,

I really gotta call that electrician to change my old 3 phase 230 V UPS socket into a 30A 115 supply...


Do you think that's a good idea? I don't know your 230V, but at 115V the power supplies loose efficiency compared to 230V. 30A @ 115V is 3.5kW, quite massive :D
I know we can draw at least 2kW over the regular 230V, whereas I heard the US net may deliver something around 1.5kW at 110V. Our 3 phase plugs are 380V and I think you can get 5 - 6 kW from them.. but you're not talking about these, right?

MrS
____________
Scanning for our furry friends since Jan 2002

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9906 - Posted: 17 May 2009 | 12:48:21 UTC

Sadly i was not paying attention so the last one did error out again,but to be honest i was expecting it to fail also since i had to restart it 3 time in a row to start seeing progress.

I am on Win XP pro with 182.50 driver and boinc 6.6.28 , for me there was however indeed some gain with the 185.85 but i just wanted to make sure the drivers aren't the issue.

The newer driver gave a little faster finishing time the old was 20 - 27 hours and the 85 between 19 - 23 hours.

I have been trying to test the older 180.XX driver,
But it made my system unstable for some reason so i cleared out all nvidia stuff and reinstalled 182.50 whql version.

I am now going to change back the boinc to 6.5.0

Profile [AF>Amis des Lapins]Gillo...
Send message
Joined: 21 Mar 08
Posts: 7
Credit: 24,394,688
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9908 - Posted: 17 May 2009 | 13:24:06 UTC

Thank you for link. I'opened the Web page of my pc , as for GPU 260GTX of this pc's I am with boinc 6.6.20 who satisfied me and Nvidia 182.08 on Win Xp pro64.


http://www.gpugrid.net/hosts_user.php?userid=1695

On the contrary for points over 24h00:10.000 points on GPU 260 O/C:( all GPU 280/285GTX

@+

Profile [AF>Amis des Lapins]Gillo...
Send message
Joined: 21 Mar 08
Posts: 7
Credit: 24,394,688
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9909 - Posted: 17 May 2009 | 13:29:32 UTC

Drivers Nvidia 1XX.XX http://www.nvidia.fr/Download/Find.aspx?lang=fr

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9910 - Posted: 17 May 2009 | 15:11:06 UTC - in response to Message 9908.

I am with boinc 6.6.20 who satisfied me


Except for the fact that some of your tasks take longer than they should?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9912 - Posted: 17 May 2009 | 15:27:21 UTC - in response to Message 9903.
Last modified: 17 May 2009 | 15:31:13 UTC

I really gotta call that electrician to change my old 3 phase 230 V UPS socket into a 30A 115 supply...


Do you think that's a good idea? I don't know your 230V, but at 115V the power supplies loose efficiency compared to 230V. 30A @ 115V is 3.5kW, quite massive :D
I know we can draw at least 2kW over the regular 230V, whereas I heard the US net may deliver something around 1.5kW at 110V. Our 3 phase plugs are 380V and I think you can get 5 - 6 kW from them.. but you're not talking about these, right?

Yes it does, the problem is that to get a 230V UPS is about twice as expensive as a normal one ... the lat time I looked to get one about the size I would need would be about 3K ...

The problem is that I can tell that I am pulling way high on the circuits in use ... if I change to another dedicated line, well, then I can leave some on the current room sockets and the rest on the dedicated line.

The only point of the exercise is to get more power to the room ... I think adding new GPUs is pushing me up to the line again ... at least I got rid of the power hungry systems that were slower than dirt.

In a month or so I will likely get an upgrade card to replace the 9800GT though I will likely keep it in the closet for that time when I upgrade to wider MB and might need a slot filler ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9915 - Posted: 17 May 2009 | 17:05:16 UTC - in response to Message 9912.

OK, except cost there's nothing to argue against a dedicated line :)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [AF>Amis des Lapins]Gillo...
Send message
Joined: 21 Mar 08
Posts: 7
Credit: 24,394,688
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9918 - Posted: 17 May 2009 | 18:20:11 UTC - in response to Message 9910.
Last modified: 17 May 2009 | 18:57:45 UTC

Yes really 84000s instead of 42000s for 14-KASHIF_HIVPR_dim_ba3-8-100-RND7871_1

http://www.gpugrid.net/result.php?Resultid=680472

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9921 - Posted: 17 May 2009 | 19:29:15 UTC - in response to Message 9918.

OK, to put it more clear: you don't like the long runtime, but you say 6.6.18/20 satisfied you. The post I linked to says that the long runtime is caused by an error in 6.6.20 and some previous clients. So something doesn't add up and you may want to up-/ or downgrade ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile [AF>Amis des Lapins]Gillo...
Send message
Joined: 21 Mar 08
Posts: 7
Credit: 24,394,688
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 9928 - Posted: 17 May 2009 | 20:45:20 UTC
Last modified: 17 May 2009 | 20:50:33 UTC

I am crossed has 6.6.28 boinc It is possible that Seti beta is responsble of this probleme. Thanks for your help, I keep posted PS3GRID about suite.

Profile Aardvark
Avatar
Send message
Joined: 27 Nov 08
Posts: 28
Credit: 82,362,324
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 9932 - Posted: 17 May 2009 | 22:47:37 UTC - in response to Message 9921.

I rolled back my drivers from 185.85 to 182.50. With windows Vista 64 bit, Boinc client 6.6.28. Since which I have returned three successfull results, one of which had run for 30 hours on one core of my 9800 GX2 and gave me just over 10,000 credits :-)
So at present this role back on the driver is working for me (touch wood).

I also rolled back the driver on my other machine from 185.85 to 180.48.With windows Vista 32 bit, Boinc client 6.6.20 (Yes, I know :-) ). This has so far returned one result, plus another well on its way. I realise that neither of these is a large sample. But looks promising given the quantity of failures I had seen just prior to changeing drivers.

I will now leave alone for a few days and see how things turn out.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9933 - Posted: 18 May 2009 | 1:03:27 UTC

I am finding it hard to tell what is going on... I seem to be getting tasks our of order so that they don't sort well on the results pages. As I watch the computers they seem to be returning mostly good results ... with occasional errors.

Well, I guess I will have to wait till Monday when the staff comes back in and fixes the universe ... :)

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9937 - Posted: 18 May 2009 | 8:46:07 UTC

I still cannot make heads nor tails of the pattern of errors. One of the problems of course is the difficulty of gathering data about the failures.

Some of the older tasks that failed on one of my systems passes on another system that is very much alike. I thought I was onto something about memory size where some of my cards have that 895 instead of 1G and the tasks passed on the 1G cards. Alas, I quickly found another case where it failed on mine and passed on someone else's card and they too had only 895 M VRAM.

Driver versions 182.50 on my systems failed, but the systems where the task passed also were running the same version.

The tasks are of all name classes...

Even my i7 with the pair of GTX295 cards finally had [url-http://www.gpugrid.net/result.php?resultid=685755]one fail[/url], the message is singularly unhelpful.

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9938 - Posted: 18 May 2009 | 8:55:14 UTC

I upgraded just the drivers on all my machines to 185.85. I had a couple of machines start getting errors.

Interestingly Seti doesn't get errors with their app. However as i'm using an app_info for them I dropped in the latest DLL's. It may just be their app is more compatible or maybe the combination of current driver with cuda 2.2 DLL's that make it work.

Has anyone tried updating the DLL's and see if that cures the problem? The only way I could see to do this is to setup an app_info so that you don't get issues with the file signatures.

I'll downgrade Maul (it has 2 x GTX260's) to 182.50 once its knocked over its current cuda work. At least it can get back to being productive while this issue gets worked out. My other machines can concentrate on Seti for a while.
____________
BOINC blog

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9942 - Posted: 18 May 2009 | 13:19:05 UTC - in response to Message 9928.
Last modified: 18 May 2009 | 13:27:35 UTC

I am crossed has 6.6.28 boinc It is possible that Seti beta is responsble of this probleme. Thanks for your help, I keep posted PS3GRID about suite.


Believe me you don't want to run seti beta together with other projects.

Seti itself has been crashing my gpugrid units also but sometimes runned without problems seti seems only to use cuda 1.0 instructions with no optimisations if you don't use the optimized ones.
The optimized kwsn application has caused me failures on gpugrid as well.

But thats probably because seti was being running together in the same time as gpugrid while i have only 1 cuda device.

I advice you not to use seti and gpugrid at the same time it has been known to me to crash many units.
Although sometimes it looks like nothing is wrong i found some units keep the memory locked so when some units are finished the ram is not released properly causing other projects (gpugrid) to error out.

Another one which is gonna give you problems together with gpugrid can be CPDN which has units which eat up to at least 1,5 GB memory, so that meant for me 4 units with 1,5 GB minimal gave me a load of 7,2 GB ram memory being used :D
Now believe me that makes trouble, if i had booted under win 64 i prolly could run them since i have 8 gb memory.
But since i run 32 bits windows it only uses 3,2 GB.

Have anyone tried to use updated dll's


Believe me i tried all combinations of drivers, boinc and cuda versions.

Everytime same result in the end some units simply crash, even when babysitting them they seem to know when i am busy doing other tasks and crash ;)
So it looks to me that if a unit gets locked it will die if you are not in time to pause and restart the unit to work.
I mean by that: The unit is locked at x,xxx % for a at least an hour if it does move the % you can try the pause/restart trick but some units will still crash no matter what i do.
Now make sure not to restart it too quick after is started again because that will surely crash it also !!

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9958 - Posted: 18 May 2009 | 22:37:42 UTC - in response to Message 9942.

we are running this set of workunits called

x-GIANNI_newFB-...

If they go on ok, then we have isolated the problem with G90 chips. It is not solved yet but still at least we would know where to look.

gdf

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9961 - Posted: 19 May 2009 | 9:19:15 UTC - in response to Message 9942.

The CPDN memory limit of 1.5Gb is set that way to allow for four running on a quad, and enough left over for op sys etc within the quoted figure of 1.5Gb. Each of the larger CPDN WUs takes up 210-220Mb in memory, therefore four of them will eat around 850Mb, with a comfortable margin for opsys etc, within the stated 1.5Gb. Its not 1.5Gb each WU, that figure they state as advisory, is total memory on the PC, not per WU.

Most CPDN models are much smaller - either side of 100Mb - albeit on the larger size than most BOINC WUs. I have happily run four of the biggest ones on my quad without issues and GPUGRID on the 9800GTX+. Usually I have two of the bigger CPDN ones running with two SETI Astropulse on the quad and GPUGRID on the 9800GTX+, they run fine with no issues.

Regards
Zy

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9964 - Posted: 19 May 2009 | 11:24:57 UTC - in response to Message 9961.

We should be able to test a fix by tomorrow.
It's a test, as the problem is not completely understood.

gdf

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9967 - Posted: 19 May 2009 | 13:20:26 UTC
Last modified: 19 May 2009 | 13:24:15 UTC

Zydor did you actually select the big units on your account page since by default they are not loaded, you really should read what it is stated there.
Those units are minimal 1,5 Gb of memory nothing else the warning is clear >.<

Sadly i have forgotten to take a screenshot when i was running 4 of those biggest units at the same time.

Ofcourse i think the change that you get 1 or even more then 2 of those big ones is very small.

I have not seen recently any of those big units, so it could have been a freak moment that i received 4 of those big units at once. It also can be that these huge units are only send to x64 machines i have not been following up any news about them.

The only thing i can say with you running seti and gpugrid together that in my case it ended up several times with crashing my pc or the unit, but again i was having more problems with the seti beta then with the normal seti.

If it does not happen to you does not mean other people can be so lucky that all goes well.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9969 - Posted: 19 May 2009 | 15:10:24 UTC

I just had a case where a suspended CPDN task caused two GPU Grid tasks to go into waiting for memory state. I had to stop BOINC and restart it so that the CPDN task (only 300K) would be swapped out ...

As usual i reported it so that there is another bug for UCB to ignore ... :)

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9981 - Posted: 19 May 2009 | 21:32:48 UTC - in response to Message 9967.
Last modified: 19 May 2009 | 21:36:08 UTC

Task Manager reported the larger ones taking up 220Mb min and did go to 400Mb at times, and four did run fine. You can get four by setting preferences for only those units.

I thought the same as you re the 1.5Gb, but also thought it strange they would produce one that size even these days, it would cripple many PCs, not a good thing for general release.

I therefore checked it out with CPDN, the response was they take up at the most 500Mb, and four of them would fit on a PC with 3Gb with no issues. When I ran the four, Task Manager reported either side of 220-400Mb in use, may well have gone to 500Mb when I was not watching, didnt log it. The post and respones is :

http://climateprediction.net/board/viewtopic.php?f=21&t=8675

I had no doubts you had issues running the two. Its also the case that often responses on success of combinations can produce as much info to help debug as failures, as a comparitor can help isolate an issue. I've often cursed when something hasnt worked, then scratched my head when I discovered others were having some success - helped me.

Regards
Zy

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1947
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 9991 - Posted: 20 May 2009 | 8:16:08 UTC - in response to Message 9981.

So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards.
We found a way to avoid it for now, but it limits what we can do, so it is not a solution.

gdf

jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 9992 - Posted: 20 May 2009 | 8:46:27 UTC - in response to Message 9991.

So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards.
We found a way to avoid it for now, but it limits what we can do, so it is not a solution.

gdf


Well at least it isn't a mystery any more. When you are on the bleeding edge, one should expect some cuts.

Hope it gets resolved in the not too distant future.

Rob

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9995 - Posted: 20 May 2009 | 11:27:05 UTC - in response to Message 9991.

So, it seems that there is a bug in the compiler/hardware which appears only on pre G200 cards.
We found a way to avoid it for now, but it limits what we can do, so it is not a solution.

gdf


How come the G200 based cards also get failures?

Will there be an updated app for the non-G200 machines, or perhaps all machines? Will this be a cuda 2.2 app or stick with the old version for the time being?

Can we use the 185.85 drivers now or with the new app (assuming there will be one)?
____________
BOINC blog

Profile KWSN-Sir Papa Smurph
Send message
Joined: 17 Mar 09
Posts: 5
Credit: 7,136,253
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 9998 - Posted: 20 May 2009 | 12:00:52 UTC

I am running an 8800GT and 3 9800Gtx+ cards. I have had zero complete Kashif Wu's .
Could you make a way for me to "opt out" of those type of units?

I still get occasional errors with other units but the Kashif are 100% failure rates for me. (3 different machines)

I am really unable to babysit my machines as I am away from home for days at at time.....

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 10000 - Posted: 20 May 2009 | 12:48:35 UTC - in response to Message 9958.

we are running this set of workunits called

x-GIANNI_newFB-...

If they go on ok, then we have isolated the problem with G90 chips. It is not solved yet but still at least we would know where to look.

gdf



These hang on my Pent D 830 with a 9600GSO. See here for a hung result that was aborted after more than 24 hours of no progress (hung at about 21%).

rbpeake
Send message
Joined: 30 Jul 08
Posts: 17
Credit: 72,993,188
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwat
Message 10002 - Posted: 20 May 2009 | 14:40:55 UTC - in response to Message 9998.

Just as a data point of reference, I have had 100% success on all work units using a GTX 260 Core 216 card, running CUDA 2.2 and 185.85 driver, even on work units that have had failures previously.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 10004 - Posted: 20 May 2009 | 15:16:35 UTC - in response to Message 9991.

There is another way .....

Tap your well heeled benefactor you have tucked away for a mere $400,000 worth of vouchers to upgrade Crunchers pre-200's to 300GTXs - a snip at the price....

And ..... added Value!! ..... You'd also solve chruncher recruitment for while, they'd queue round to the next street, let alone next Block, for that one .....

*sigh* ........ aways nice to dream once in a while :)

Regards
Zy

rbpeake
Send message
Joined: 30 Jul 08
Posts: 17
Credit: 72,993,188
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwat
Message 10005 - Posted: 20 May 2009 | 15:24:44 UTC - in response to Message 10004.

There is another way .....

Tap your well heeled benefactor....

Regards
Zy

Believe me, I am tapping my heels that my card continues to function as well as it has! I just bought it, so it would be a big disappointment if there were issues so soon....but the issue fix would appear to be possible without a card upgrade, hopefully....(although NVIDIA I would guess is tapping its heels that many will upgrade...ouch! ;)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10020 - Posted: 21 May 2009 | 10:09:42 UTC - in response to Message 9995.

How come the G200 based cards also get failures?


Let's see what the fix can do and who still gets failures afterwards. Mind you, there's also the "regular failure rate", some kind of "noise floor" which affects all cards.

Will there be an updated app for the non-G200 machines, or perhaps all machines? Will this be a cuda 2.2 app or stick with the old version for the time being?
Can we use the 185.85 drivers now or with the new app (assuming there will be one)?


Not speaking officially, but I wouldn't rush to introduce another variable in the current situation. Wait until the dust settles and we're confident that the problems have been solved. 185.66 has been running fine for me with non-troublemaker WUs, so I'll keep using it until I see problems.

I do have a WU issued today and it appears to use client 6.64, so it may look like no new app for now. But this could be tied to an old type of WU as well.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 10071 - Posted: 22 May 2009 | 21:31:39 UTC - in response to Message 10020.

Just had a Kashif go bang

ERROR: c:\cygwin\home\speechserver\gpumd2\src\pme\CPME_cufft.cu, line 104: cufftExecC2R (gridcalc3)

http://www.gpugrid.net/result.php?resultid=699822

I have had a problem on that PC re Office, and got it back online 5 hours ago. However I dont think it was that issue, I think it looks like the old WU problem surfacing - maybe in one of the older WUs still in the system??

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10088 - Posted: 23 May 2009 | 14:38:09 UTC - in response to Message 10071.

Looks like the old problem and the WU was created past 20 May 16:44 CEST, when the fix was applied. I think it would be better to post such observations in the new thread, so they don't get lost.

MrS
____________
Scanning for our furry friends since Jan 2002

Post to thread

Message boards : Graphics cards (GPUs) : Recent problems for WUs on older GPUs

//