Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers
Author | Message |
---|---|
Devs, | |
ID: 29318 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks. | |
ID: 29319 | Rating: 0 | rate:
![]() ![]() ![]() | |
... GTX 660 Ti (usually runs 2 GPUGrid tasks) How do you get a GTX 660 TI to run TWO GPUGrid tasks?? ____________ | |
ID: 29629 | Rating: 0 | rate:
![]() ![]() ![]() | |
How do you get a GTX 660 TI to run TWO GPUGrid tasks?? You use an app_config.xml file. I'd recommend doing plenty of research beforehand, though, using the following links: http://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration http://www.gpugrid.net/forum_thread.php?id=3319 http://www.gpugrid.net/forum_thread.php?id=3331 And if you happen to notice any tasks completing immediately while still granting credit, which is a bug we're still tracking down, then please discontinue the use of the app_config.xml file, and post your results/info here: http://www.gpugrid.net/forum_thread.php?id=3332 Regards, Jacob | |
ID: 29630 | Rating: 0 | rate:
![]() ![]() ![]() | |
How do you get a GTX 660 TI to run TWO GPUGrid tasks?? Blimey!! I'm on the case!!! Tom ____________ | |
ID: 29631 | Rating: 0 | rate:
![]() ![]() ![]() | |
it often crashes the NVIDIA driver, and leads to Computation Errors on tasks that are running across all GPUs, causing me to lose work, even from other projects--------- | |
ID: 29676 | Rating: 0 | rate:
![]() ![]() ![]() | |
GPUGrid.net Devs: | |
ID: 30033 | Rating: 0 | rate:
![]() ![]() ![]() | |
I have this exact same problem. My specs are | |
ID: 30035 | Rating: 0 | rate:
![]() ![]() ![]() | |
Today I replaced my GTX 460 with a GTX 660. My first WU is a Noelia, which looks like it will complete in 12 hours; 25% done in three hours. Much better! | |
ID: 30290 | Rating: 0 | rate:
![]() ![]() ![]() | |
Today I replaced my GTX 460 with a GTX 660. My first WU is a Noelia, which looks like it will complete in 12 hours; 25% done in three hours. Much better! You didn't say what you have in your app config, just "posted here". I don't see it in this thread at least. Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660. | |
ID: 30292 | Rating: 0 | rate:
![]() ![]() ![]() | |
You didn't say what you have in your app config, just "posted here". I don't see it in this thread at least. Thank you for responding. You're right. In this thread there is only a pointer to another thread. Sorry for the confusion. Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660. Ah! That's not what I had understood: that 50% + 50% = 100% but no bonuses... I just wonder why there has been so much kerfuffle here on a 'feature' (2x) that benefits no-one. Whatever, if only for a challenge I'd like to give 2x a try. Can you tell me why it does not work for the .XML file below? Thanks. <app_config> <app> <name>acemdlong</name> <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> </app> <app> <name>acemd2</name> <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> </app> <app> <name>acemdshort</name> <max_concurrent>9999</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.001</cpu_usage> </gpu_versions> </app> </app_config> ____________ | |
ID: 30299 | Rating: 0 | rate:
![]() ![]() ![]() | |
Apparently you're trying to run 2 WUs concurrently. If so, they won't make the 24 hour deadline. The new NATHANS are even longer. Are you trying to increase your credit? Even if they run without problem, you will end up with lower credit than running 1X on a GTX 660. Jacob was running 2X on his 660 Ti with 3GB on the MUCH shorter NATHAN WUs that are now unfortunately gone. If your GPU won't make the 24hr deadline (including DL, UL & reporting time), then you will miss the 24hr bonus and your credit will take a significant hit. That's even if everything runs optimally: errors are likely to be more frequent. Running 1X should be better for the project too as the time from WU generation to WU completion will most likely be less, an issue here. | |
ID: 30302 | Rating: 0 | rate:
![]() ![]() ![]() | |
tomba, <gpu_usage>1</gpu_usage> statement tells BOINC how many gpu's to use for each task. Currently it is set to use 1 full GPU for each task so only 1 task will run on each gpu at a time. If you set it to <gpu_usage>0.5</gpu_usage> that would tell BOINC to use half of a GPU for each task which would allow 2 tasks to run on each GPU. Be sure to post your test results so we can see if it helped. | |
ID: 30304 | Rating: 0 | rate:
![]() ![]() ![]() | |
tomba, Wow! That works! Thank you!! I'm now running two Noelias: ![]() ...and below are the pre- and post- 2x results from my GPU Monitor gadget. I will certainly report back on results! Many thanks. ![]() [/img] ____________ | |
ID: 30305 | Rating: 0 | rate:
![]() ![]() ![]() | |
I will certainly report back on results! It's very early days but, for both running Noelia WUs, the "Remaining (estimated)" time is counting down much faster than one per second... ____________ | |
ID: 30307 | Rating: 0 | rate:
![]() ![]() ![]() | |
tomba, | |
ID: 30310 | Rating: 0 | rate:
![]() ![]() ![]() | |
Developers: Devs, | |
ID: 30375 | Rating: 0 | rate:
![]() ![]() ![]() | |
I will forward it. From a quick forum search it seems to be W7/W8 and driver related. So there might not be much we can do. Are you certain it only happens with Noelias and no other WUs? | |
ID: 30405 | Rating: 0 | rate:
![]() ![]() ![]() | |
I will forward it. From a quick forum search it seems to be W7/W8 and driver related. So there might not be much we can do. Are you certain it only happens with Noelias and no other WUs? The issue happens sometimes when GPU tasks are suspended. This means it will hopefully be easy for you guys to reproduce. I believe I've only seen the problem on NOELIA tasks. For reference, I'm using Windows 8 x64, with the new v320.18 WHQL drivers. It should be a matter of letting the task run for a some time (15 seconds), then suspending it... then just keep doing that several times, and hopefully you'll see the problem after a few tries. I'd be curious to know if you (or anyone in GPUGrid) can reproduce it? Thanks, Jacob | |
ID: 30406 | Rating: 0 | rate:
![]() ![]() ![]() | |
Guys, | |
ID: 30418 | Rating: 0 | rate:
![]() ![]() ![]() | |
Guys, Hello Matt, The crash is on suspend. I've seen it happen when: - I click "Activity -> Suspend GPU" - I right-click the tray to choose "Snooze GPU" - I manually suspend the task by clicking the task "suspend" button in BOINC - as well as when BOINC suspends work due to me starting an app that is configured as an <exclusive_app> in my config.xml file. I do use the "Leave applications in memory while suspended" setting, so I never lose my CPU tasks' work, and I don't believe that option affects the GPU tasks. However, next time I get a NOELIA task, I will try testing with that option off. Have you been able to reproduce the issue? | |
ID: 30419 | Rating: 0 | rate:
![]() ![]() ![]() | |
Starting to get a few funky NOELIA tasks as well. GTX 580 + GTX 670, win7 x64, | |
ID: 30439 | Rating: 0 | rate:
![]() ![]() ![]() | |
JugNut, | |
ID: 30444 | Rating: 0 | rate:
![]() ![]() ![]() | |
For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler). | |
ID: 30497 | Rating: 0 | rate:
![]() ![]() ![]() | |
For me driver restart happens everytime with the current Noelias (or better: I have not observed it not happening) but not with other WUs. Win 8 drivers 320.18 and 314.22 (the last 2 WHQLs), "leave apps in memory" active (but I read it doesn't apply to GPUs, as it would be far too risky to leave something dirty or to run out of memory) and a GTX660Ti (Kepler). Try a test with the watchdog disabled. Seems to be working for me. I'm also running XP but I don't know if that has anything to do with it. | |
ID: 30516 | Rating: 0 | rate:
![]() ![]() ![]() | |
Driver restart capability was introduced to mainstream desktops with the release of Vista, through the Windows Display Driver Model (WDDM). Driver restarting is unlikely to be an issue in XP, as the display driver architecture was very different. | |
ID: 30519 | Rating: 0 | rate:
![]() ![]() ![]() | |
To answer your questions from earlier Jacob, we have not been able to reproduce the error, unfortunately. We only have one box running right now testing Windows 7, and we have not received a NOELIA since those tasks are now dwindling in number. We of course don't doubt that it is real, considering so many people confirming it, we just haven't been able to troubleshoot it yet. Even so, if it is being caused by the driver watchdog or some other Windows bug, we might not be able to do much about it. It will be interesting to see if it still occurs with the watchdog disabled. That should tell us a lot. | |
ID: 30522 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thank you for responding, Nate. | |
ID: 30523 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yes we do test the WU's, but unfortunately (at the moment) we test them locally on our Linux machines and not running GPUGrid. So we don't catch such problems. | |
ID: 30524 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'd also suggest setting up a test box with several OS's. Of course you can't test everything in advance, but if problems are reported under specific configurations you could react more quickly by just booting into an affected OS. | |
ID: 30549 | Rating: 0 | rate:
![]() ![]() ![]() | |
NOELIA tasks are still very much an issue for me. | |
ID: 31557 | Rating: 0 | rate:
![]() ![]() ![]() | |
Do not suspend, don't exit BOINC, haha. ;) | |
ID: 31672 | Rating: 0 | rate:
![]() ![]() ![]() | |
Since I cannot trust any GPUGrid.net units to shutdown gracefully anymore, here has been my workaround: | |
ID: 31849 | Rating: 0 | rate:
![]() ![]() ![]() | |
Unfortunately no :( There is really no time right now to focus on this. I understand it is quite a big problem and we are aware of it. | |
ID: 31862 | Rating: 0 | rate:
![]() ![]() ![]() | |
This is VERY UNFORTUNATE that I have to do this tedious workaround any time I want to use my GPU. It's a Windows problem, not a GPUGrid problem. ____________ ![]() | |
ID: 31863 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hm, good to know. | |
ID: 31864 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hm, good to know. It can be a problem with any GPU project and many games. http://msdn.microsoft.com/en-us/library/windows/hardware/ff553893%28v=vs.85%29.aspx I've posted a fix that worked for me a couple of different places on this site but I'll post it here again for anyone to try. My fix goes a little farther than the Windows suggestion and it works for me. I can suspend and restart tasks, reboot the computer with tasks running, even do a hard shut down and restart. The tasks always restart from where they were with no errors or driver timeout/restarts. YMMV but this has worked well for me on ATI and Nvidia cards. Copy and paste the entire code below (including the Windows Registry Editor Version 5.00 part) into notepad. Rename it timeout fix.reg or something else if you'd like as long as it ends with the .reg extension. After renaming it right click on it and open it with registry editor. You'll get warnings about editing the registry. Just click yes and the code will be added to your registry. Reboot and you should be good to go. This should stop the driver has stopped responding messages and the errors to the WUs when the driver restarts. It will not affect anything else in the registry if it doesn't work. Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog] "DisableBugCheck"="1" [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display] "EaRecovery"="0" | |
ID: 31872 | Rating: 0 | rate:
![]() ![]() ![]() | |
It's a kind-of-a-windows-problem which gets triggered by Noelias tasks, certainly not the "god old" trouble free Nathans and I don't think the Santi SRs either, which I'm running now. | |
ID: 31883 | Rating: 0 | rate:
![]() ![]() ![]() | |
Interesting: TJ is reporting no driver reset problems with the current Noelias over there. | |
ID: 31920 | Rating: 0 | rate:
![]() ![]() ![]() | |
We have turned down priority on Noelia tasks. You should get less and less until she gets back. | |
ID: 31978 | Rating: 0 | rate:
![]() ![]() ![]() | |
I greatly appreciate the stability my machine has had over the past couple weeks, due to not suspending any NOELIA tasks. | |
ID: 32317 | Rating: 0 | rate:
![]() ![]() ![]() | |
Did you try nanoprobe's suggested fix? | |
ID: 32327 | Rating: 0 | rate:
![]() ![]() ![]() | |
His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it. | |
ID: 32329 | Rating: 0 | rate:
![]() ![]() ![]() | |
Even with a 20second registry configured delay, Noelia's WU's still trigger a driver restart when suspended, changing app, CPU/Boinc Snooze or closing Boinc. | |
ID: 32334 | Rating: 0 | rate:
![]() ![]() ![]() | |
Strange, that's never happened to me and I've suspended them dozens of times. | |
ID: 32340 | Rating: 0 | rate:
![]() ![]() ![]() | |
His suggested fix is to disable TDR, which I use for games and for other GPU applications. I rely on it. So, no, I didn't try it. Same here: the watchdog saves me from real GPU errors often enough that I don't want to disable it. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 32408 | Rating: 0 | rate:
![]() ![]() ![]() | |
I don't disable it either, I use a 20sec delay (but I don't game). I've had numerous experiences where the mouse arrow freezes for a few seconds and then everything is as was (without a driver restart and without WU's crashing). Prior to using it I had numerous crashy-the-driver experiences! | |
ID: 32416 | Rating: 0 | rate:
![]() ![]() ![]() | |
The next beta will have additional critical section locking that will hopefully mitigate this problem. | |
ID: 32458 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thank you a million times over for setting aside some time to solve this. | |
ID: 32463 | Rating: 0 | rate:
![]() ![]() ![]() | |
Try out 8.02. Give it a damn good suspending. | |
ID: 32474 | Rating: 0 | rate:
![]() ![]() ![]() | |
Try out 8.02. Give it a damn good suspending. Awesome - Initial testing looks very promising! I cannot immediately make it crash. I will do more testing (especially with the exclusive app logic that suspends tasks) later tonight. Edit: I may have been able to make it still crash. Will test more later. What did you change/fix? I'm a developer, and am very curious about what the change was. Also, is it a change that could improve exit-logic for non-NOELIA tasks? | |
ID: 32478 | Rating: 0 | rate:
![]() ![]() ![]() | |
The problem stems from BOINC killing off the process while a GPU operation is underway. The fix is to add BOINC critical section assertions around GPU operations. In the old app, not all GPU operations were so locked. http://boinc.berkeley.edu/trac/wiki/BasicApi There may be other circumstances under which a driver hang can be induced, but this should substantially reduce the incidence rate.
It'll be good for all WUs. Indeed, its not obvious why those poor NOELIAs always took the brunt of it. MJH | |
ID: 32480 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hey MJH, glad to have you back! The project feels alive again.. thanks! | |
ID: 32494 | Rating: 0 | rate:
![]() ![]() ![]() | |
Try out 8.02. Give it a damn good suspending. 8.04 KLEBE tasks are still causing driver resets :( My scenario is that I have 2 of them running - 1 on my GTX 460 and 1 on my GTX 660 Ti, and I'm choosing "Suspend GPU" from the system tray. Can you please see if you need to add any more critical section mutexes? Thanks, Jacob | |
ID: 32573 | Rating: 0 | rate:
![]() ![]() ![]() | |
As frequently as before? MJH | |
ID: 32584 | Rating: 0 | rate:
![]() ![]() ![]() | |
Frequency is quite hard to conclusively prove. | |
ID: 32585 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've got 2 WU's which I couldn't start anymore because as soon as I resume them the nVidia Driver will crash. | |
ID: 33577 | Rating: 0 | rate:
![]() ![]() ![]() | |
Hype, | |
ID: 33578 | Rating: 0 | rate:
![]() ![]() ![]() | |
Unclear if these are crashing the driver; there is no message saying that it has. | |
ID: 33596 | Rating: 0 | rate:
![]() ![]() ![]() | |
More of the same type of problem, on a different NOELIA task. | |
ID: 33598 | Rating: 0 | rate:
![]() ![]() ![]() | |
Could be a driver issue, try updating to the latest (beta) driver. | |
ID: 33601 | Rating: 0 | rate:
![]() ![]() ![]() | |
Unclear if these are crashing the driver; there is no message saying that it has. I couldn't find the beta test drivers. However, when the 331.58 and 331.65 drivers came out, I installed them. No such crashes under either of these. | |
ID: 33655 | Rating: 0 | rate:
![]() ![]() ![]() | |
Unclear if these are crashing the driver; there is no message saying that it has. Correction - the 331.65 driver makes such crashes less frequent (perhaps every other day), but does not stop them completely. | |
ID: 33684 | Rating: 0 | rate:
![]() ![]() ![]() | |
Crashes less frequent with the 331.65 driver, but not gone. GPU workunits for other BOINC projects still running properly. | |
ID: 33686 | Rating: 0 | rate:
![]() ![]() ![]() | |
some update that looks likely to fix this problem. ??? explain please ::grabs popcorn:: | |
ID: 33687 | Rating: 0 | rate:
![]() ![]() ![]() | |
some update that looks likely to fix this problem. I'm waiting for an Nvidia driver release, a BOINC release, or a GPUGRID application release before I enable GPUGRID workunits on that computer again. | |
ID: 33785 | Rating: 0 | rate:
![]() ![]() ![]() | |
some update that looks likely to fix this problem. I've now installed BOINC 7.2.28 on the computer with the problem. I tried installing the 331.82 Nvidia driver a few times; it never installed correctly. I'm back to the 331.65 driver. GPUGRID workunits have run properly on that computer for the last few days, with no more driver crashes. | |
ID: 33977 | Rating: 0 | rate:
![]() ![]() ![]() | |
some update that looks likely to fix this problem. This wasn't enough to fully fix the problem; however, the driver crashes no longer crash Windows also. I'll watch to see if the new crashes cause enough problems that I need to put GPUGRID back in No new tasks on that computer. | |
ID: 33993 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've been trying to crash NOELIA tasks on my 660T1 by repeatedly (20 X) suspending and resuming them but I can't. This seems like a Windows bug to me. Suggest installing Linux to fix it or a script that peeks at the names of the GPUgrid tasks in your cache every 60 secs and aborts them if they're NOELIA. Either way would be a path of lesser resistance leading to greater productivity and oneness with the Buddha ;) | |
ID: 34002 | Rating: 0 | rate:
![]() ![]() ![]() | |
The 8.14 application version fixed the problem that this thread described. It has been resolved for a while now.. | |
ID: 34003 | Rating: 0 | rate:
![]() ![]() ![]() | |
But I couldn't crash them with the previous version either. | |
ID: 34004 | Rating: 0 | rate:
![]() ![]() ![]() | |
It was a Windows only fix. | |
ID: 34013 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Number crunching : NOELIA tasks - when suspended or exited, often crash drivers