Advanced search

Message boards : Graphics cards (GPUs) : unspecified launch failure

Author Message
STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 318,222,298
RAC: 287,491
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 3821 - Posted: 13 Nov 2008 | 10:53:22 UTC

I get the following error every so often on This Box It's a BFG 8800GT OC running at the speed when I bought it ...

Cuda error: Kernel [frc_sum_kernel_dihed] failed in file 'force.cu' in line 252 : unspecified launch failure.

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3824 - Posted: 13 Nov 2008 | 14:35:20 UTC - in response to Message 3821.

Cuda error: Kernel [frc_sum_kernel_dihed] failed in file 'force.cu' in line 252 : unspecified launch failure.



I've received the same issue on a single task recently and I've never seen it before. I do have both 8800GT's OC'd some, but I haven't changed that in well over a month. I wouldn't think it is related. I've since completed a couple WU's fine, so I just chalked it up to something strange happened at one point in time. If it happens again, I'll have more reason to be concerned.

http://www.gpugrid.net/result.php?resultid=115911

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 3870 - Posted: 17 Nov 2008 | 18:32:13 UTC

My first error with this log on 8800GT 1GB:

http://www.ps3grid.net/result.php?resultid=124548

Any solution or info about this error yet?

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3871 - Posted: 17 Nov 2008 | 18:37:48 UTC
Last modified: 17 Nov 2008 | 18:40:17 UTC

Just found out that my WU which crashed this morning (near before it was finished!) had the same error:

http://www.gpugrid.net/result.php?resultid=122585

My card is a 9600GT.
And it seems it crashed my Windows too! As I came back some hours later I just found out my comp had a reboot to Linux (I have a dual-boot and Linux is standard).
____________
Member of BOINC@Heidelberg and ATA!

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3916 - Posted: 21 Nov 2008 | 13:04:25 UTC
Last modified: 21 Nov 2008 | 13:05:30 UTC

Another one killed itself with such a message.

What's wrong with them?
It gets really annoying, that costs me almost an entire day of crunching every time...
>:-\
____________
Member of BOINC@Heidelberg and ATA!

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 3917 - Posted: 21 Nov 2008 | 14:07:22 UTC - in response to Message 3916.

These are the same wus as before. Have you updated the drivers? Which drivers do you have?

gdf

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3918 - Posted: 21 Nov 2008 | 14:23:29 UTC
Last modified: 21 Nov 2008 | 14:30:01 UTC

It's driver version 177.84, no change since I started crunching here.
I have no clue what could be wrong, crunched two other WUs right before without any problems:
http://www.gpugrid.net/result.php?resultid=126657
http://www.gpugrid.net/result.php?resultid=125288

Edit:
Just found out on the NVidia page that version 180.48 is now recommended for my card.
I will install and try it out, maybe it fixes the problem...
Will take some days to discover that. ;-)
____________
Member of BOINC@Heidelberg and ATA!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3924 - Posted: 22 Nov 2008 | 12:09:51 UTC
Last modified: 22 Nov 2008 | 12:11:30 UTC

2 observations:

1. WUs which give this error generally run fine on other machines

2. All who reported this error in this thread are running (factory) overclocked GPUs.

I think it's worth testing if there's a link between 1. and 2. DrNow, you seem to get the errors most frequently. Could you take the core and shader clock of your card back a bit?

On G92 the core can only be adjusted on 27 MHz steps and the shader in 54 MHz steps. GPU-Z and other tools do not show you the real clock speed, but RivaTuners hardware monitor does. So I suggest you to either check the clocks with RivaTuner or to back off enough to be in a safe range, where you really change clocks. Say 54 MHz for the core and 108 MHz for the shader. Then let it run for some time and if the error happens again we know clock speed was not the cause.

Oh, and it might be a good idea to do a complete restart of your machine before the clock speed experiment. That means switch it off, take the power cord off the power supply for >15 min and switch on again.

BTW: driver 177.84 has been fine before, so I doubt it causes the errors. Could be possible, though, since the application code has changed since the time when most people ran 177.84.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3929 - Posted: 22 Nov 2008 | 15:03:45 UTC - in response to Message 3924.
Last modified: 22 Nov 2008 | 15:04:52 UTC

2 observations:

1. WUs which give this error generally run fine on other machines

2. All who reported this error in this thread are running (factory) overclocked GPUs.

I think it's worth testing if there's a link between 1. and 2. DrNow, you seem to get the errors most frequently. Could you take the core and shader clock of your card back a bit?

Well, you could be right.
First WU after the driver change did run fine so far. I will crunch two, three other WUs to see if the error appears again.
If yes, I will take the shader rate down a bit.

As you may have readed in one of the other threads, RivaTuner accidentally did take down my shader rate without my knowledge and the WUs took much longer, but all finished without problems...
____________
Member of BOINC@Heidelberg and ATA!

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 3935 - Posted: 23 Nov 2008 | 9:30:14 UTC
Last modified: 23 Nov 2008 | 9:38:24 UTC

Next one:
http://www.ps3grid.net/result.php?resultid=131339

driver 178.24 WinXP, RivaTuner 2.20

I cannot explain me why?! And after loosing many hours, why this error is not coming on start ^^

Before I run older version of RivaTuner to decrease the speed of the fan, around 57°C on 8800GT, so no problem. My first week all WUs are fine, but after 3-4d 2 times same error, something must be wrong somewhere but where?

Edit:
Checked all results, the error came with 6.3.21, before 6.3.19 all ok.

If anyone run >6.4 or <6.3.21, pls me know if you get the same error or not!

Profile [BOINC@Poland]AiDec
Send message
Joined: 2 Sep 08
Posts: 53
Credit: 9,213,937
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwat
Message 3936 - Posted: 23 Nov 2008 | 9:58:55 UTC
Last modified: 23 Nov 2008 | 10:01:07 UTC

As I have read and heard many times Riva is not best software and can make problems with Nvidia graphic cards. Specially with newest GPU. I`ve get similar problems as long as I`ve used Riva. I would like to suggest you to use nTune which can give you bigger chance for `correct` OC. This software makes my GPUs really stable after hard OC (3x280GTX 600@702MHz).
____________

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 3937 - Posted: 23 Nov 2008 | 10:19:03 UTC - in response to Message 3936.

As I have read and heard many times Riva is not best software and can make problems with Nvidia graphic cards. Specially with newest GPU. I`ve get similar problems as long as I`ve used Riva. I would like to suggest you to use nTune which can give you bigger chance for `correct` OC. This software makes my GPUs really stable after hard OC (3x280GTX 600@702MHz).


I havent OC my card, I will try ntune next time and uninstall RivaTuner to see if that issue is still present, but I have read some with linux got this error too.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3941 - Posted: 23 Nov 2008 | 12:26:04 UTC

- people got the error before 6.3.21

- Rebirther, your card is factory overclocked (shader at 1.67 GHz instead of 1.50
GHz)

I cannot explain me why?! And after loosing many hours, why this error is not coming on start ^^


It is a temporary error on your machine. That means normally your machine is fine and the WUs are (normally) fine for others. That the error occurs after many hours of crunching tells you that probably something goes wrong during the calculations. It's not a permanent error, it's a "transient" one.

Such errors may be caused by really weird software constellations, bit-flips in the chip due to cosmic rays, hardware design faults which only occur in rare, exceptional situations (e.g. for CPUs several interrupts at the same time etc.) or by a chip which is just borderline to become unstable in the balance between clock frequency, voltage and operating temperature.

Saying "but it was stable for ..." does not really help. It could be that a few transistors are worse than the others (or have degraded more over time) and fail every 10^15 cycles or so, leading to a "mean time between failures" of days.

- And I don't think the mere presence of RivaTuner causes these errors. I mean, it's not even running all the time, is it? Also Rebirthers GPU is *old* enough (G92) to be supported properly.

I`ve get similar problems as long as I`ve used Riva.


Which problems do you mean exactly? The "unspecified launch failure"?

I would like to suggest you to use nTune which can give you bigger chance for `correct` OC


Well, RivaTuner and (I think) Everest are the only tools which can show you the real clock of your NV card, all others only show you the clock which you request from the system. The real clock is adjusted in steps. So if you can clock higher using nTune it may be that you're just below the next step, where it would become unstable. The internal clocks would be the same, but the number shown to you would be higher, hence it seems to be a higher OC.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 3947 - Posted: 23 Nov 2008 | 15:33:59 UTC - in response to Message 3941.
Last modified: 23 Nov 2008 | 16:17:55 UTC

- people got the error before 6.3.21

- Rebirther, your card is factory overclocked (shader at 1.67 GHz instead of 1.50
GHz)

I cannot explain me why?! And after loosing many hours, why this error is not coming on start ^^


It is a temporary error on your machine. That means normally your machine is fine and the WUs are (normally) fine for others. That the error occurs after many hours of crunching tells you that probably something goes wrong during the calculations. It's not a permanent error, it's a "transient" one.

Such errors may be caused by really weird software constellations, bit-flips in the chip due to cosmic rays, hardware design faults which only occur in rare, exceptional situations (e.g. for CPUs several interrupts at the same time etc.) or by a chip which is just borderline to become unstable in the balance between clock frequency, voltage and operating temperature.

Saying "but it was stable for ..." does not really help. It could be that a few transistors are worse than the others (or have degraded more over time) and fail every 10^15 cycles or so, leading to a "mean time between failures" of days.

- And I don't think the mere presence of RivaTuner causes these errors. I mean, it's not even running all the time, is it? Also Rebirthers GPU is *old* enough (G92) to be supported properly.

I`ve get similar problems as long as I`ve used Riva.


Which problems do you mean exactly? The "unspecified launch failure"?

I would like to suggest you to use nTune which can give you bigger chance for `correct` OC


Well, RivaTuner and (I think) Everest are the only tools which can show you the real clock of your NV card, all others only show you the clock which you request from the system. The real clock is adjusted in steps. So if you can clock higher using nTune it may be that you're just below the next step, where it would become unstable. The internal clocks would be the same, but the number shown to you would be higher, hence it seems to be a higher OC.

MrS


Factory oc, yes, but this is not a problem, you got also this error as many others too on newer cards or old ones, I dont think this is a hardware failure in all models of cards?! I have asked on alpha mailing list about this issue to limit the error, still waiting for an answer, so is it the hardware, boinc client or the project application? Drivers and other programs can be excluded.

The GPU waiting for CPU could be an issue so the WU abort by itself with this error because it can not crunch furthermore from the last point.

Update:
thx to nicolas to pointed out its not the boinc client, app 6.48 with 0% error rate, 6.52 with 20% error rate.

@GDF: can you check the application code to find out whats wrong? Or can you switch back to the old app?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3973 - Posted: 23 Nov 2008 | 23:25:13 UTC - in response to Message 3947.

Factory oc, yes, but this is not a problem


How can you be sure? Hardware errors can pop up quite seldomly. These are actually the hardest to detect, because you can never be sure if
(i) your test software can reproduce the error at all and
(ii) you tested long enough.

you got also this error as many others too


Yeah, I also noticed this one yesterday.. and guess what, I'm also running OC'ed.

I dont think this is a hardware failure in all models of cards?!


Not every OC'ed card produces these errors, don't they?

I have asked on alpha mailing list about this issue to limit the error, still waiting for an answer, so is it the hardware, boinc client or the project application? Drivers and other programs can be excluded.


I agree that we can exclude drivers and other programs. However, I'd also suspect that the BOINC client has absolutely nothing to do with this. It just launches the aecmd_.exe and all further CUDA related launches are done by the science app.

The GPU waiting for CPU could be an issue so the WU abort by itself with this error because it can not crunch furthermore from the last point.


Sounds somewhat unprobable. The GPU can not talk to BOINC, so if the CPU app stops working then "noone" would tell BOINC that an error happened. It would likely detect after a short time that the app has quit and restart it. This is the point where some trouble may be caused, when the GPU / driver is a strange state because the CUDA app was not terminated properly. Is this just a guess on your side or do you have anything hinting at such a scenario?

Update:
thx to nicolas to pointed out its not the boinc client, app 6.48 with 0% error rate, 6.52 with 20% error rate.

@GDF: can you check the application code to find out whats wrong? Or can you switch back to the old app?


- Where do you get that 20% error rate from?
- I also had another one of these "unspecified launch failure" errors - with app 6.45.
- Switching back to the old app is probably not feasible, since there were changes in the science code.
- Oh, and who's Nicolas?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 3977 - Posted: 23 Nov 2008 | 23:42:30 UTC - in response to Message 3973.



- Where do you get that 20% error rate from?
- I also had another one of these "unspecified launch failure" errors - with app 6.45.
- Switching back to the old app is probably not feasible, since there were changes in the science code.
- Oh, and who's Nicolas?

MrS


- 20% is my error rate estimated from last calculation
- Nicolas Alvarez, also a developer of BOINC/Primegrid/IMP/Renderfarm
- we must sort out what was changed in code and causes this error
- cannot find any scenario yet (removed rivatuner, installed ntune), will see what happens... (2 cores running vmware with ubuntu linux 64bit + ABC, other 2 cores BOINC in windows with GPU + Milkyway, RCN, yoyo evo)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 3981 - Posted: 23 Nov 2008 | 23:52:20 UTC

Yeah, let's get some new hard facts. But by saying

- we must sort out what was changed in code and causes this error


you imply that you already know it's the science apps fault. We can not know that yet. I think it's not the app, because these errors happen with different clients and the WUs run fine on other machines.

.. gotta go to bed for today ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4022 - Posted: 25 Nov 2008 | 6:46:32 UTC - in response to Message 3929.

Well, you could be right.
First WU after the driver change did run fine so far. I will crunch two, three other WUs to see if the error appears again.
If yes, I will take the shader rate down a bit.

Well, after having finished three WUs without a problem (see here, here and here) now I have the error again with this WU, fortunately very early during the crunching.

After looking on my host-list it seems the error comes in repeatedly times and is not caused by something special.

Okay, I will reduce my shader clock now to see if it breaks the rule then. ;-)
____________
Member of BOINC@Heidelberg and ATA!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4024 - Posted: 25 Nov 2008 | 9:04:39 UTC

Well, the period of succesful WUs between failures is anything between 2 and 6.. I'd rather call that a guideline ;)

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4080 - Posted: 29 Nov 2008 | 13:57:46 UTC

I had another one, luckily in the beginning of the WU. I scaled back the OC and will see what I get.

MrS
____________
Scanning for our furry friends since Jan 2002

Scott Brown
Send message
Joined: 21 Oct 08
Posts: 144
Credit: 2,973,555
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwat
Message 4108 - Posted: 1 Dec 2008 | 13:14:56 UTC

Well, looks like I just had the first of these...http://www.gpugrid.net/result.php?resultid=140145

9600GSO (G92) with factory OC, WinXP Pro 32-bit, 6.3.21 default install. Only one so far from numerous workunits. Only change in the machine was my participation in the PrimeGrid challenge with the 2 CPU cores on this machine. Since the LLR app. from PrimeGrid is extremely CPU intensive, I wonder if this could have caused some issues?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4119 - Posted: 1 Dec 2008 | 22:03:01 UTC
Last modified: 1 Dec 2008 | 22:04:07 UTC

I highly doubt it and I can tell you for sure that I'm not running prime grid (QMC & Milkyway).

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4128 - Posted: 2 Dec 2008 | 15:39:15 UTC
Last modified: 2 Dec 2008 | 15:40:14 UTC

Well, this time it took a little longer, until it appeared again for me - to be precise 4 successful WUs were between, but now I have one again.
All the time they ran with 1700 MHz, with some little breaks between when I wanted to game.
I guess I'm trying to take it down again by 100 MHz, but more isn't very senseful in my eyes, I want to make at least 1 WU per day. ;-)

Anyway, from the regularly times that it appears, I have the feeling now that it's a little hardware problem from my GPU, but nothing really spectacular and uninteresting, at least not so long until it gets worse.

Btw:
I'm happy that you guys don't purge the database and every result is still watchable. :-)
That gives a pretty good overview about the done work.
____________
Member of BOINC@Heidelberg and ATA!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4145 - Posted: 4 Dec 2008 | 20:48:30 UTC

Yes, 100 MHz should do. I'd also take the core down a bit (like I probably wrote somewhere above) .. approximately 50 MHz.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 4248 - Posted: 11 Dec 2008 | 11:32:30 UTC

After a while and some finished WUs with an uninstalled RivaTuner, installed ntune + BOINC 6.4.2 I have no more errors seen yet. So a possible error reason could be RivaTuner.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4436 - Posted: 17 Dec 2008 | 19:38:21 UTC - in response to Message 4248.

No, I still have RivaTuner installed and got the error. No other ones since I clocked down, but then they were not frequent to begin with (2 in a couple of months).

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Wassertropfen
Avatar
Send message
Joined: 14 Aug 08
Posts: 15
Credit: 13,774,919
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4443 - Posted: 17 Dec 2008 | 22:29:13 UTC

Does anyone of you also chrunch seti@home beta cuda app?

My System crash. The screen was crowded with white an black dot. Than the GPUGrid WU crashes. :( 16 WU dead. 24h no more WU. :(

Know I now Seti or GPUGrid.
____________
Constant dripping wears away the stone. :)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 4995 - Posted: 28 Dec 2008 | 12:57:48 UTC

DrNow,

you're still getting many errors and it does not seem to matter much if you run 1.8 or 1.7 GHz shader clock. Did you also downclock core and memory for this test? What's the temperature of your chip while crunching?

And an interesting side note: the error happens at different lines in the force.cu file, so I really don't think it's a software issue.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5019 - Posted: 28 Dec 2008 | 17:27:52 UTC - in response to Message 4995.
Last modified: 28 Dec 2008 | 17:30:08 UTC

you're still getting many errors and it does not seem to matter much if you run 1.8 or 1.7 GHz shader clock. Did you also downclock core and memory for this test? What's the temperature of your chip while crunching?

And an interesting side note: the error happens at different lines in the force.cu file, so I really don't think it's a software issue.

Hi ETA.

Well, as last said here I also think it maybe a hardware problem from my graphic card, but I have to stick with it for a while, I can't buy a new, bigger one as my current case isn't suitable for that (power supply is directly behind the card! ).

To the WUs:
The last "unspecified launch failure" is from 12-22-2008.
I don't know why this one later failed, it's no "ULF" as you can see.
And this from 23rd has another failure message, obviously a WU-error.

Besides that, I didn't had much time over the christmas days to continue the tests. Strangely enough, the shader clock went back to 1.8 GHz without any doings from my side some days ago. Maybe my changes from Windows to Linux and back did that without my knowledge.
(Under Linux I don't have configured BOINC for CUDA yet, but in the next days I guess I will try out, as openSuse 11 is now better supported)

To your questions:
I didn't change GPU or memory clock during my tests. And the temp while crunching lies at 70° to 75°C, a good value I think. Previous app versions did take my 9600GT up to 90 and 100 degrees.
____________
Member of BOINC@Heidelberg and ATA!

Profile Kokomiko
Avatar
Send message
Joined: 18 Jul 08
Posts: 190
Credit: 24,093,690
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 5021 - Posted: 28 Dec 2008 | 18:02:43 UTC - in response to Message 5019.


And the temp while crunching lies at 70° to 75°C, a good value I think. Previous app versions did take my 9600GT up to 90 and 100 degrees.


Alas :(, the new application don't use the possible power of the graphic cards on a Windows system. I had to go back to 3+1 to avoid, that my card runs much longer (up to 30%) than before.

GPUGrid needs a lot of calculation power, so don't let us make the rules on the basis of the little graphic cards. On this speedway are too much big cars, so we can't make the rules on the basis of the infants tricycle ...

____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5024 - Posted: 28 Dec 2008 | 22:28:28 UTC - in response to Message 5019.

I didn't change GPU or memory clock during my tests.


You'd have to do that as well to get some meaningful results.

And the temp while crunching lies at 70° to 75°C, a good value I think. Previous app versions did take my 9600GT up to 90 and 100 degrees.


Yes, 70 - 75°C has to be fine. I could imagine one getting errors at 90 - 100°C if the chip is not very good, but at 70°C the card should be rather tolerant to higher clock speeds.

If I were you I'd lower shader/core/mem clock by 100/50/50 MHz and let it run for at least 10 WUs. If you don't get errors during this time we could be getting somewhere.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5030 - Posted: 29 Dec 2008 | 6:39:13 UTC - in response to Message 5024.

If I were you I'd lower shader/core/mem clock by 100/50/50 MHz and let it run for at least 10 WUs. If you don't get errors during this time we could be getting somewhere.

Okay, with the next WU I will start another experiment and will that try out.
____________
Member of BOINC@Heidelberg and ATA!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5035 - Posted: 29 Dec 2008 | 10:20:55 UTC

Nice sig-pic, btw ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5052 - Posted: 29 Dec 2008 | 18:47:17 UTC - in response to Message 5035.

Thanx, ETA. :-)
Thought there was a change necessary because of the coming events. ;-)

And I've just finished my newest WU.
I've lowered clock/shader/mem now as you said and downloaded another WU. Let's see how far it brings me...
____________
Member of BOINC@Heidelberg and ATA!

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5194 - Posted: 3 Jan 2009 | 14:28:38 UTC

Well, still getting the ULF, no difference from my first test without adjusting memory and GPU clock.
Fortunately it happened very early in this WU.
I'm going down another 100/50/50 now...
____________
Member of BOINC@Heidelberg and ATA!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5216 - Posted: 3 Jan 2009 | 19:37:15 UTC

Mhh, it seems like the time between failures got a bit longer.. but it's too uncertain. Well, certainly not the clear-cut situation I had hoped for.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile DoctorNow
Avatar
Send message
Joined: 18 Aug 07
Posts: 83
Credit: 122,995,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5707 - Posted: 17 Jan 2009 | 7:29:32 UTC

Wow, already 14 days ago... :-)
Looks like I found the right settings now for my 9600GT, didn't get the error again in the last 7 WUs.
Hope this thread now really can wander into the archive. ;-)
____________
Member of BOINC@Heidelberg and ATA!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5714 - Posted: 17 Jan 2009 | 10:41:11 UTC - in response to Message 5707.

So it seems too high clock speed is really the cause for (some of) the unspecified launch failures. Let your card run a bit longer with these settings. And eventually you should increase the clocks again to the initial values and see, if the errors return -> double check. But don't hurry with that.

MrS
____________
Scanning for our furry friends since Jan 2002

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5718 - Posted: 17 Jan 2009 | 12:07:59 UTC
Last modified: 17 Jan 2009 | 12:11:35 UTC

I got this one. Neither the machine or graphics card are overclocked. It popped-up the "application error has occured" dialogue box asking if I wanted to send a report to Microsoft, as if they'd know what to do with it.

Running under XP, card hasn't had any errors previously. All my other errors are one's i've aborted (seeing as GPUgrid gives out too much work with such short deadlines). Driver is 180.48.

After that I shutdown BOINC, but being 6.5.0 it didn't terminate the science apps. I then rebooted the machine which produced a flurry of other compute errors for all the projects it was crunching when it shutdown (Einstein, Seti and GPUgrid).

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5722 - Posted: 17 Jan 2009 | 13:36:42 UTC - in response to Message 5718.

In your case the WU failed with an unspecified launch failure after 1.3s and at a quite different file / line than in most other cases. I'd tend to say it's a similar symptom for a different cause.

And regarding "After that I shutdown BOINC, but being 6.5.0 it didn't terminate the science apps." .. if I remember correctly there has not been any feedback in the thread this was discussed. If I remember correctly I asked for the BOINC version which people were running who experienced this problem and posted that I don't see this behaviour with 6.5.0. So I'm not convinced 6.5.0 is to blamed for your current problems ;)
Just to make sure.. did you just shut down the BOINC manager or did you choose "advanced/shut down connected client"?

MrS
____________
Scanning for our furry friends since Jan 2002

Lazydude2
Send message
Joined: 24 Dec 08
Posts: 3
Credit: 519,408
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 5724 - Posted: 17 Jan 2009 | 14:21:37 UTC - in response to Message 5722.


And regarding "After that I shutdown BOINC, but being 6.5.0 it didn't terminate the science apps." .. if I remember correctly there has not been any feedback in the thread this was discussed. If I remember correctly I asked for the BOINC version which people were running who experienced this problem and posted that I don't see this behaviour with 6.5.0. So I'm not convinced 6.5.0 is to blamed for your current problems ;)
Just to make sure.. did you just shut down the BOINC manager or did you choose "advanced/shut down connected client"?

MrS


I'm running Boinc 6.5.0 on Vista x64 and i have same issue:
If shut down Boinc-manager from both the icon or File/exit -
it does not stop the apps, GPUgrid Seti or Prime that I'm running atm.
The window shuts down but are still active in TaskMan (running 100% on one core).
Must stop it manually in TaskMan.
They stop only trough "advanced/shut down connected client" -
Its the same behavior as it was with one of 6.2.16-18

So an reebot may crach the apps that not are properly shut down.


BTW I have not got 6.6.0 to work for me at all. - It Starts but can't connect/start the apps.
B. Regards
Lazy

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5725 - Posted: 17 Jan 2009 | 14:31:28 UTC - in response to Message 5722.

In your case the WU failed with an unspecified launch failure after 1.3s and at a quite different file / line than in most other cases. I'd tend to say it's a similar symptom for a different cause.

And regarding "After that I shutdown BOINC, but being 6.5.0 it didn't terminate the science apps." .. if I remember correctly there has not been any feedback in the thread this was discussed. If I remember correctly I asked for the BOINC version which people were running who experienced this problem and posted that I don't see this behaviour with 6.5.0. So I'm not convinced 6.5.0 is to blamed for your current problems ;)
Just to make sure.. did you just shut down the BOINC manager or did you choose "advanced/shut down connected client"?

MrS


I was exiting BOINC and have the 'terminate science applications' check-box ticked so its supposed to work, but then it is a development version. I have logged this issue in trak a couple of weeks ago with some pretty screen shots for the BOINC developers.

I do not blame it for the launch failure - just the mess when I shut it down.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5726 - Posted: 17 Jan 2009 | 14:31:53 UTC - in response to Message 5724.

Must stop it manually in TaskMan.


No, because, just as you said yourself:

They stop only trough "advanced/shut down connected client"


This is a an intentional change, decoupling BOINC manager and client. It's been in BOINC since some time. If it's a good choice or not is a different question.

MrS
____________
Scanning for our furry friends since Jan 2002

MarkJ
Volunteer moderator
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 738
Credit: 200,909,904
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 5727 - Posted: 17 Jan 2009 | 14:47:25 UTC - in response to Message 5724.
Last modified: 17 Jan 2009 | 14:51:38 UTC

I'm running Boinc 6.5.0 on Vista x64 and i have same issue:
If shut down Boinc-manager from both the icon or File/exit -
it does not stop the apps, GPUgrid Seti or Prime that I'm running atm.
The window shuts down but are still active in TaskMan (running 100% on one core).
Must stop it manually in TaskMan.
They stop only trough "advanced/shut down connected client" -
Its the same behavior as it was with one of 6.2.16-18

So an reebot may crach the apps that not are properly shut down.


Try Advanced -> Shutdown running science applications. Wait a few seconds and see what happens. In my case it starts them all up again! Best if you also have Task Manager open at the same time so you can see if it does terminate them or not.

BTW I have not got 6.6.0 to work for me at all. - It Starts but can't connect/start the apps.
B. Regards
Lazy


I haven't tried 6.6.0 yet. Is it only GPUgrid or does the "can't connect/start" issue also effect other projects?

The BOINC developers have said they will be releasing 6.6.2 fairly soon with the changes to the work-fetch logic for (supposedly) better gpu work fetching.
____________
BOINC blog

Lazydude2
Send message
Joined: 24 Dec 08
Posts: 3
Credit: 519,408
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 5730 - Posted: 17 Jan 2009 | 15:42:34 UTC - in response to Message 5727.


Try Advanced -> Shutdown running science applications. Wait a few seconds and see what happens. In my case it starts them all up again! Best if you also have Task Manager open at the same time so you can see if it does terminate them or not.

It opens up a window:
Yes - this happens when I press "OK"
No -it do not happen when i press "Cancel"


I haven't tried 6.6.0 yet. Is it only GPUgrid or does the "can't connect/start" issue also effect other projects?

Yes, happens to all projects i tried with.


The window shuts down but are still active in TaskMan (running 100% on one core).
Must stop it manually in TaskMan.


Sorry - I was not clear on this:

When I have successfully shut down the apps (Advanced -> Shutdown running science applications).
And I want to shut down Boinc-manager - the window disappears, the icon in the Notification area disappears - But are still active in Task Manager.

So the only thing I know to stop Boinc-managre are to kill BM in TaskMan.



Maybe this is off topic here, but ...

Nice Crunching
Lazy

Post to thread

Message boards : Graphics cards (GPUs) : unspecified launch failure

//