Advanced search

Message boards : Graphics cards (GPUs) : Unspecified kernel launch failures

Author Message
Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15217 - Posted: 14 Feb 2010 | 17:41:57 UTC

I've now had two work units in as many days error out with some variation on the above error message, one GPUGRID and one PrimeGrid, both times a significant proportion of the way through the WU.

GPUGRID gave me:
Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure.
PrimeGrid gave:
Cuda error: kernel invocation: unspecified launch failure

The card is a 9600GT, apparently very slightly factory overclocked (to 675MHz, from 650 stock), running under 64 bit Ubuntu Linux with 190.53 drivers and BOINC 6.10.17. Temperature monitor's not gone above 70C under GPUGRID and 65C with PrimeGrid, with the fan at about 50%, according to the nVIDIA settings app.

I figure it can't be a project-specific bug, and those temperatures are a long way under the throttling limits. Could it be the factory overclock, and if so how would I reverse that? Or do I just have a dodgy card and I'll have to put up with it?

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15222 - Posted: 14 Feb 2010 | 23:51:41 UTC - in response to Message 15217.

May I suggest you stick to one GPU project at a time, and you may want to upgrade that driver, and then restart your system. Finish whatever tasks you are running first and select no more tasks.
Also, I would suggest that you configure Boinc so that your GPU stops crunching when you are using your system, just in case your use causes the crash, say when you are watching online video, or a movie.

GL,

Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15223 - Posted: 14 Feb 2010 | 23:58:05 UTC
Last modified: 15 Feb 2010 | 0:10:44 UTC

190.53 is the latest stable driver. Both crashes were while I wasn't using the computer.
Is there a problem with running multiple GPU projects? I mostly enabled GPU WUs from PrimeGrid as a test after the GPUGRID unit errored out, but I can easily turn them off again.
Edit: worked out how to enable clock speed alteration in the nVIDIA settings program and turned the core frequency down to stock speeds; will see if it keeps crashing.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15225 - Posted: 15 Feb 2010 | 10:19:14 UTC - in response to Message 15223.

There is a fair chance your problems were caused by overclocking. Even if a system runs stably for ages, a new application or task could stress cards in different ways, and a couple of degrees change in the room temperature could swing your configurations towards unstable (someone shut the room door, or it just got slightly warmer outside and inside).

In the past I ran more than one GPU project and found that it caused an increase in errors, so I stopped. I cant say if this is still the case, or a factor in your problems, but if you stopped doing it and found that there were no more errors it would suggest that was the problem.

I hope you stopp getting errors after going back to factory settings.

Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15228 - Posted: 15 Feb 2010 | 10:33:20 UTC - in response to Message 15225.
Last modified: 15 Feb 2010 | 10:36:34 UTC

Just had another errored PrimeGrid unit, having clocked down to stock settings and disabled new GPUGRID units until this gets sorted or otherwise (less lost credit from an error in an hour-long unit than in a day-long unit).

Any other suggestions?

Edit: There are 195.36.03 beta drivers available, might they be worth a try, or are they likely to be less stable?

GPUGRID Role account
Send message
Joined: 15 Feb 07
Posts: 134
Credit: 1,349,535,983
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 15231 - Posted: 15 Feb 2010 | 11:39:52 UTC - in response to Message 15228.

Down-clock the GPU (and and other overclocked part of the system) to reference speeds. Let us know if the problem persists thereafter.

MJH

Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15233 - Posted: 15 Feb 2010 | 11:50:48 UTC - in response to Message 15231.
Last modified: 15 Feb 2010 | 11:52:58 UTC

The GPU is now running at reference settings (950MHz), yes, and nothing else has been overclocked. I've had one computation failure so far in about 12 hours since I did that, which is the same sort of rate I was experiencing before downclocking.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15235 - Posted: 15 Feb 2010 | 13:19:28 UTC - in response to Message 15233.

I am more convinced that the problems are related to running 2 projects.
Just run one, to see if there is an improvement.

Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15236 - Posted: 15 Feb 2010 | 16:58:32 UTC - in response to Message 15235.

As I said, I'm not accepting any new GPUGRID units, only PrimeGrid's shorter units. I haven't run a GPUGRID unit since the one that crashed. (I can, of course, turn off PrimeGrid's and turn on GPUGRID's, but since both are giving the same error and PrimeGrid's short units lose me less credit when they crash, I'd prefer to keep it this way around.)

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15239 - Posted: 15 Feb 2010 | 18:33:13 UTC - in response to Message 15236.
Last modified: 15 Feb 2010 | 18:37:06 UTC

You were correct to disable GPUGrid tasks and stick to the shorter tasks, at least to see what happens. However you are still getting the odd error. So you have identified that the error is not limited to when 2 different projects are running, but you have not found the cause of the errors. I know it made sense to run shorter tasks (because any error will have less impact on your contribution and points). However, if you were running both projects at the same time and got GPUGrid errors, these errors could have happened as a result of a bug from the PrimeGrid project (for example when switching between tasks - speculation of course).

So if you stop running PrimeGrid and start running GPUGRID exclusively you will be able to confirm if the problem was being caused solely by PrimeGrid or any GPU application, which would indicate either a hardware issue or a driver issue rather than an application or task issue. You could then try driver updates (or previous versions) or look for a hardware issue.

If you are looking for a hardware problem eliminate the most likely causes first; the generic ones. So test your RAM, do a disk check, make sure your card is tightly installed, check the temperatures are not getting out of hand. It is also a good idea to disable any unnecessary start-up processes, and uninstall any un-required programs and devices.

As for PrimeGrid, I dont run it, so I cant make any specific suggestions. Although some others here might be able to help you, the best place would be the primegrid forums for help with any primegrid errors.

Good Luck,

Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15240 - Posted: 15 Feb 2010 | 18:48:58 UTC - in response to Message 15239.

The GPUGRID task was the only GPU WU on the system when it crashed; PrimeGrid's tasks were only fetched after that one errored out, according to the message log. I'm pretty sure we can rule out project-specific bugs. (I'm asking here rather than PrimeGrid because, given the error isn't limited to either project, it makes sense to ask the heavily GPU-centric project, rather than the one with a single GPU app).

I doubt it's a general system hardware problem as I've had no CPU computing errors, but I will run further tests and check things are seated properly etc. Temperatures are at maximum 70C, well below the area I'd expect to see problems, and cooler with PrimeGrid (max 65C).

Assuming I don't find anything, I'll try the BETA drivers and report back.

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 15263 - Posted: 16 Feb 2010 | 22:08:26 UTC - in response to Message 15240.

Try to lower your shader clock to below 1600 and see if it works.

gdf

Sentynel
Avatar
Send message
Joined: 13 Feb 10
Posts: 7
Credit: 5,369
RAC: 0
Level

Scientific publications
wat
Message 15294 - Posted: 18 Feb 2010 | 12:40:42 UTC - in response to Message 15240.

I've had no errors since switching to the beta drivers, and I've completed both PrimeGrid and GPUGRID units. Given an approximate rate of about 1 error every 12 hours before the driver switch, the chances of having no errors in the 48 hours or so I've been running GPU units since switching drivers are about 2%, so I'm pretty confident this is sorted. Thanks for the help, folks.

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 837,894
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 15295 - Posted: 18 Feb 2010 | 14:35:48 UTC - in response to Message 15294.

Glad to hear your problem seems to be solved!

For future reference, here's another thing one can try when you start getting unexplained errors:

Kill the rabbits! Specifically, the dust bunnies in the computer. Over time, dust accumulates all over, but it's really bad when it starts clogging the cooling fins underneath heatsink fans. The system starts loosing lots of cooling ability when that happens, and unpleasant things happen, including errors.

I prefer to use a vacuum rather than compressed air for this, and to reach under the fan I tape a straw onto the end of the flat vacuum attachment. It works wonders.

Post to thread

Message boards : Graphics cards (GPUs) : Unspecified kernel launch failures

//