Advanced search

Message boards : Number crunching : GPU Errors - It's not just me

Author Message
DigitalDingus
Send message
Joined: 2 Jun 09
Posts: 10
Credit: 21,969,126
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 13826 - Posted: 8 Dec 2009 | 4:29:45 UTC
Last modified: 8 Dec 2009 | 4:30:24 UTC

http://www.gpugrid.net/workunit.php?wuid=1005808

If you take a look at these errors, it's not just my own PC. It's everyone elses. This not a PC problem. It's a GPU problem, and I'd appreciate somebody taking a look into this.
____________

DigitalDingus
Send message
Joined: 2 Jun 09
Posts: 10
Credit: 21,969,126
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 13879 - Posted: 11 Dec 2009 | 14:22:17 UTC - in response to Message 13826.
Last modified: 11 Dec 2009 | 14:22:37 UTC

Another example:

http://www.gpugrid.net/workunit.php?wuid=1009407
____________

Profile ENDYMION IV
Send message
Joined: 26 Nov 09
Posts: 3
Credit: 317,354
RAC: 0
Level

Scientific publications
watwatwat
Message 13880 - Posted: 11 Dec 2009 | 15:28:13 UTC

I got some problems too on two PCs. Sometimes after 6-10s of crunching. Sometimes after 40h : very disagreable ...

Please do something. You should ...

Regards
____________

Barraud Denis
Avatar
Send message
Joined: 2 Sep 08
Posts: 15
Credit: 36,207,656
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13888 - Posted: 12 Dec 2009 | 2:06:20 UTC

Sparkle GTX 250 1Go, no oveclok FAN 90% 54°C
BOINC 6.1.21 / 195.62 /
Q9550 4Go DDR2@333.3 1:1 XP 32 bits SP2

example: http://www.gpugrid.net/result.php?resultid=1611786

a lot off WU in error after less than 10 secondes....

Cuda error: Kernel [shake_step_1] failed in file 'shake.cu' in line 79 : unspecified launch failure.


!!!!!! Dammed GTX 250
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 13932 - Posted: 14 Dec 2009 | 21:45:10 UTC - in response to Message 13888.

You should update drivers to the latest drivers.
This will let you receive the cuda23 application which should solve your problems.

gdf

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13958 - Posted: 15 Dec 2009 | 21:14:12 UTC - in response to Message 13932.
Last modified: 15 Dec 2009 | 21:16:03 UTC

You should get your video card drivers directly from NVidia rather than using a Microsoft update service!

http://www.nvidia.co.uk/Download/index.aspx?lang=en-uk
Or similar, for different regions.

jphelan
Send message
Joined: 20 Jul 08
Posts: 4
Credit: 4,082,270
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 13977 - Posted: 18 Dec 2009 | 4:00:59 UTC - in response to Message 13932.

Boy, do I have a flash for you! I've used, " cuda23 ". It still doesn't work! When I use it with SETTI I have no problems. I've discontinued crunchinng numbers for GPUGRID until you guys get your act together.

jphelan1242@hotmail.com
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13996 - Posted: 19 Dec 2009 | 17:42:42 UTC - in response to Message 13977.
Last modified: 19 Dec 2009 | 17:44:28 UTC

Your GTX9800+ card uses a G92 core, and there is a CUDA bug that causes issues with G92 cores and CUDA 2.3.
As you are using driver version 19107 this bug will be exposed more than with more recent drivers. If you do decide to try to run that card, you should update the driver and select Boinc Preferences so that the card does not crunch GPUGrid tasks when you are using the computer:
I fould that Crunching and playing videos dont go together on the G92 cores. This might also include surfing; with all the online media content these days.

Profile ENDYMION IV
Send message
Joined: 26 Nov 09
Posts: 3
Credit: 317,354
RAC: 0
Level

Scientific publications
watwatwat
Message 13998 - Posted: 19 Dec 2009 | 23:23:20 UTC - in response to Message 13996.

I have two PC's with GPU Cards GT130 upgraded with the last version from nVidia (and not from MicroSoft). One of my PC is under Vista, the other Windows 7 (much better !). They are not overclocked and GPUGrid computes only when the PC is idle.

But nevertheless, two WU for 3 run into a boring compute error.

Doctor, is it normal ? Which one is sick ? GPUGRID or my two computers ?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14002 - Posted: 20 Dec 2009 | 13:23:30 UTC - in response to Message 13998.

Your GT130 cards uses a 65nm GPU core (G96M). Relative to other cards of this level, it will be more prone to heat & cooling problems. It may also struggle to complete tasks in time; it only has 32 shaders. Even if the system was on 24/7 some tasks would take 3days to complete. It is asking a lot of any program to run for 3 days without any glitch, so to run one set of calculations for that time is always likely to be error prone.
That said, most problems are being caused by a CUDA bug. When GPUGrid moved to CUDA 2.3 this new bug seemed to raise its head, and was at first difficult to identify. It seems to primarily effect G92 cores, but is known to cause problems with the GT200 (Not GT200b) and obviously the G96m cores to some extent too.
After checking dozens of tasks from many people, the TONI_HERG tasks tend to fail more than others, but this is not to say that other tasks will not also fail.
I would suggest that you keep an eye on the tasks arriving at your system and abort any TONI_HERG tasks that come in. It would also be a good idea to make sure you do not receive more than one task at a time. By the time you finish one task the others deadline will be rapidly approaching!
I see from your message that you have already implemented the other good suggestions (dont run tasks when system is in use). I managed to improve my GTS250 performance from 25% lost time to about 11.5% lost time. It is still improving. The techs also looked at reducing some task lengths, and were at least asked to look into an allocation system based on the cards people have.
One last thing to watch out for is updates, these tend to force applications to close and restart your system. Forcing applications to close crashes tasks!

So a bit of PC management might improve things.
Good luck,

Profile ENDYMION IV
Send message
Joined: 26 Nov 09
Posts: 3
Credit: 317,354
RAC: 0
Level

Scientific publications
watwatwat
Message 14006 - Posted: 20 Dec 2009 | 18:39:59 UTC - in response to Message 14002.

Thanks for your very detailed answer.

I already used to keep an eye to the deadline of each WU and I allow new tasks only when the current WU is near its end.

According to your advice, I'll try to setup the Boinc Manager to "Leave application in memory while suspended", because I suppose that pausing and restarting could have the same affect a closing application.


____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14008 - Posted: 20 Dec 2009 | 19:18:43 UTC - in response to Message 14006.

Leaving the application in memory while suspended is a good idea. If you close Boinc and then open it again, tasks resume from their last saved positions. This could be 1 second ago or more likely, several minutes ago. So if someone kept stopping and starting, and did not keep tasks in memory, they might not get through any tasks before the deadline.

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 14015 - Posted: 21 Dec 2009 | 15:46:44 UTC - in response to Message 13932.
Last modified: 21 Dec 2009 | 15:48:49 UTC

My copy of cuda23 fails with this error:

ERROR: mdsim.cu, line 101: Failed to parse input file
called boinc_finish

is this a known issue ?

http://www.gpugrid.net/workunit.php?wuid=1037695
http://www.gpugrid.net/workunit.php?wuid=1037822
http://www.gpugrid.net/workunit.php?wuid=1037814
etc.
____________
Join team Bletchley Park, the innovators.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,894,376,317
RAC: 19,767,827
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14016 - Posted: 21 Dec 2009 | 16:30:48 UTC - in response to Message 14015.

You had six in a row on host 34464 - all IBUCH_reverse1_pYEEI.

I just got one from that sequence on host 45218, but the next was a GIANNI_BIND which is happily crunching.

I think it must be a bad batch of WUs - if it wasn't a known issue before, I hope it is now.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,894,376,317
RAC: 19,767,827
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14017 - Posted: 21 Dec 2009 | 19:06:04 UTC

And similarly on host 43404.

IBUCH_reverse1_pYEEI failed, following KASHIF_HIVPR running fine.

Profile Hydropower
Avatar
Send message
Joined: 3 Apr 09
Posts: 70
Credit: 6,003,024
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 14019 - Posted: 21 Dec 2009 | 20:16:39 UTC - in response to Message 14017.

Thank you for that clarification. I did indeed get a new set of files to crunch, they seem to work fine now.

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 14027 - Posted: 22 Dec 2009 | 10:42:24 UTC - in response to Message 14017.

Noticed.
They should be cancelling.

thanks,
i

Post to thread

Message boards : Number crunching : GPU Errors - It's not just me

//