Message boards : Number crunching : Abrupt computer restart - Tasks stuck - Kernel not found
Author | Message |
---|---|
I recently had a power outage here, where the computer lost power while it had been working on BOINC. Name I66R8-NATHAN_KIDKIXc22_6-9-50-RND7714_1 Workunit 4795185 Created 29 Sep 2013 | 9:39:42 UTC Sent 29 Sep 2013 | 9:56:59 UTC Received 30 Sep 2013 | 4:01:08 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 153764 Report deadline 4 Oct 2013 | 9:56:59 UTC Run time 48,589.21 CPU time 48,108.94 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Name 17x6-SANTI_RAP74wtCUBIC-13-34-RND0681_0 Workunit 4807187 Created 29 Sep 2013 | 13:06:23 UTC Sent 29 Sep 2013 | 17:32:54 UTC Received 30 Sep 2013 | 4:01:08 UTC Server state Over Outcome Computation error Client state Aborted by user Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI Computer ID 153764 Report deadline 4 Oct 2013 | 17:32:54 UTC Run time 17,822.88 CPU time 3,669.02 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> aborted by user </message> ]]> Name 112-MJHARVEY_CRASH3-14-25-RND0090_2 Workunit 4807215 Created 29 Sep 2013 | 17:32:12 UTC Sent 29 Sep 2013 | 17:32:54 UTC Received 29 Sep 2013 | 19:04:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status -226 (0xffffffffffffff1e) ERR_TOO_MANY_EXITS Computer ID 153764 Report deadline 4 Oct 2013 | 17:32:54 UTC Run time 4,020.13 CPU time 1,062.94 Validate state Invalid Credit 0.00 Application version ACEMD beta version v8.14 (cuda55) Stderr output <core_client_version>7.2.16</core_client_version> <![CDATA[ <message> too many exit(0)s </message> <stderr_txt> # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 # GPU 0 : 68C # GPU 1 : 61C # GPU 2 : 83C # GPU 1 : 63C # GPU 1 : 64C # GPU 1 : 65C # GPU 1 : 66C # GPU 1 : 67C # GPU 1 : 68C # GPU 0 : 69C # GPU 1 : 69C # GPU 1 : 70C # GPU 0 : 70C # GPU 1 : 71C # GPU 0 : 71C # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 14:56:38 (1696): Can't acquire lockfile (32) - waiting 35s # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 ... # GPU [GeForce GTX 460] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 460 # ECC : Disabled # Global mem : 1024MB # Capability : 2.1 # PCI ID : 0000:08:00.0 # Device clock : 1526MHz # Memory clock : 1900MHz # Memory width : 256bit # Driver version : r325_00 : 32723 Kernel not found# SWAN swan_assert 0 </stderr_txt> ]]> | |
ID: 33274 | Rating: 0 | rate: / | |
Interesting that you and Matt - no, not Matt Harvey, the guy in GPUGrid Start Up/Recovery Issues - should both post about similar issues on the same day. <suspended_via_gui/> to the <result> block for the suspect task. As the name suggests, that's the same as clicking 'suspend' for the task while BOINC is running, and gets control of the machine back so you can investigate on the next normal restart. By convention, the line goes just under <plan_class> in client_state, but I think anywhere at the first indent level will do. Interesting point about stderr.txt - I hadn't looked that far into it. The process for stderr is: It gets written as a file in the slot directory On task completion, the contents of the file gets copied into that same <result> block in client_state.xml The <result> data is copied into a sched_request file for the project's server The scheduler result handler copies it into the database for display on the web. So, which of those gets skipped if a task gets aborted? Next time it happens, I'll follow the process through and see where it goes missing. Any which way, it's probably a BOINC problem, and I agree it would be better if partial information was available for aborted tasks. You and I both know where and how to get that changed once we've narrowed down the problem ;) | |
ID: 33277 | Rating: 0 | rate: / | |
I have the same problem with Nathan units on a GTX460 but I didn't have power outages. | |
ID: 33278 | Rating: 0 | rate: / | |
Also had my GTX660TI throw a wobbly on a Noelia WU here | |
ID: 33279 | Rating: 0 | rate: / | |
I recently had a power outage here, where the computer lost power while it had been working on BOINC. Jacob, this has been my life with my GTX 590 box for the last month. I usually just end up resetting the whole project because the apps will not continue. It may run for a day or two or it may just run for two hours before BSOD. I'm fighting the nvlddmkm.sys thing right now and will probably end up reinstalling as a last ditch effort. This system does not normally crash unless BOINC is running GPUGrid WUs. It is not overclocked and is water cooled. All timings and specs are as from the Dell factory for this T7500. But yeah..I completely understand what you're going through. Operator ____________ | |
ID: 33284 | Rating: 0 | rate: / | |
MJH: | |
ID: 33327 | Rating: 0 | rate: / | |
I did reinstall the OS on my GTX 590 box and have not installed any updates. | |
ID: 33337 | Rating: 0 | rate: / | |
I had the same problem with this WU on my GTX 580 machine - http://www.gpugrid.net/workunit.php?wuid=4819239 | |
ID: 33367 | Rating: 0 | rate: / | |
MJH: Any response? | |
ID: 33368 | Rating: 0 | rate: / | |
Jacob, | |
ID: 33369 | Rating: 0 | rate: / | |
Ha! Considering it seems like it should be easy to reproduce (turn off PC, via switch and not via normal shutdown, in the middle of the GPUGrid task)... Challenge accepted. | |
ID: 33373 | Rating: 0 | rate: / | |
MJH: | |
ID: 33374 | Rating: 0 | rate: / | |
MJH: | |
ID: 33375 | Rating: 0 | rate: / | |
Matt, | |
ID: 33376 | Rating: 0 | rate: / | |
MJH: FWIW - My GTX 460 machine finished the task that I posted about. Although it took longer than 24-hours, it was a SANTI-baxbim task - http://www.gpugrid.net/workunit.php?wuid=4818983 Also, I have to say that I somewhat agree with the above post about people who run this project really needing to know what they are doing. I'm a software developer / computer scientist by trade, and I build my own PCs when I need them. One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. In general, I have found this project to be relatively stable with this, perhaps, the only serious fault I have encountered so far. However, when faults like this arise, it would almost certainly take skilled people to get out of the situation created. Unfortunately, though, this and other similar projects, at least as I see it, are on the bleeding edge. As in my job where the software that I work with is also on the bleeding edge (a custom FEA program), it is sometimes extraordinarily difficult to catch a bug like this since it sounds like it occurs only under limited circumstances that may not be caught in a test of the software unless, in this case, the PC were shut down abnormally. Just my $0.02. ____________ | |
ID: 33377 | Rating: 0 | rate: / | |
One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. That's another thing that is a "trap" and confusing. While the deadline is 5 days if you don't return within 2 days the WU is resent to another host and if that host returns first (likely) your computing time, if you return a result, has been wasted because the first valid result returned is canonical and yours is binned. ____________ Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline | |
ID: 33391 | Rating: 0 | rate: / | |
One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. The 2 day resend was ceased long ago. | |
ID: 33392 | Rating: 0 | rate: / | |
One reason that I think many people might leave this project is that it takes so long to run a WU, and that they must be returned in 5-days. Some people might turn their machines off, and thus would not be able to return the WU in 5-days. Personally, I only run this project on weekends. I type corrected :-) ____________ Radio Caroline, the world's most famous offshore pirate radio station. Great music since April 1964. Support Radio Caroline Team - Radio Caroline | |
ID: 33393 | Rating: 0 | rate: / | |
Alright, back on topic here... | |
ID: 33394 | Rating: 0 | rate: / | |
Alright, back on topic here... Jacob; Are you talking about when one of the GPUs TDR it screws up all the other tasks running on other GPUs as well? That happens to me on my GTX590 box all the time (mostly power outages). If one messes up and ends up causing a TDR or complete dump and reboot, when I start BOINC again all the remaining WUs in progress on the other GPUs also cause more TDRs unless I abort them. Sometimes even that doesn't help and I have to completely reset the project. Example: I had a TDR the other day. Three WUs were uploading at the time. Only one was actually processing. Fine. So I reboot and catch BOINC before it starts processing the problem WU and suspend processing so the three that did complete can upload for credit. Now, I abort the problem WU and let the system download 4 new WUs. As soon as processing starts, Wham! Another TDR. So I do a reset of the project and 4 more WUs download and start processing without any problem at all. So the point is, unless I reset the project when I get a TDR I'm just wasting my time downloading new WUs because they are all going to continue to crash until I do a complete reset. So I'm not sure what file that's left over in the BOINC or GPUGrid project folder(s) is causing the TDRs after the original event. Is that the same issue you are talking about here or am I way off? Operator ____________ | |
ID: 33397 | Rating: 0 | rate: / | |
I believe your issue is a separate issue. | |
ID: 33398 | Rating: 0 | rate: / | |
Was there a resolution to this? | |
ID: 33501 | Rating: 0 | rate: / | |
There has been no recent contact from MJH, and so no resolution. | |
ID: 33502 | Rating: 0 | rate: / | |
Thanks. I had no problems this past weekend. However, I did not experience any abnormal shutdowns or freezes. | |
ID: 33504 | Rating: 0 | rate: / | |
Are we still collecting these? I had a sticking task - multiple driver restarts after a forced reboot - with 23x6-SANTI_RAP74wtCUBIC-18-34-RND6543_0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 # GPU 0 : 74C # GPU 1 : 55C # GPU 0 : 75C # GPU 1 : 56C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # GPU 0 : 79C # GPU 1 : 57C # GPU 0 : 80C # GPU 0 : 81C # GPU 1 : 58C # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 | |
ID: 33520 | Rating: 0 | rate: / | |
I sent MJH some files, but haven't heard from him :/ | |
ID: 33521 | Rating: 0 | rate: / | |
And it's just happened again, this time with potx108-NOELIA_INS1P-0-14-RND5839_0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 1 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:08:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 # GPU 0 : 76C # GPU 1 : 56C # GPU 1 : 57C # GPU 1 : 58C # GPU 1 : 59C # GPU 1 : 60C # GPU 1 : 61C # GPU 0 : 77C # GPU 1 : 62C # GPU 1 : 63C # GPU 0 : 78C # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 22:21:16 (5824): Can't acquire lockfile (32) - waiting 35s # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. # SWAN swan_assert 0 # GPU [GeForce GTX 670] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX 670 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:07:00.0 # Device clock : 1084MHz # Memory clock : 3054MHz # Memory width : 256bit # Driver version : r331_00 : 33140 I seem to see similarities in SWAN : FATAL : Cuda driver error 702 in file 'swanlibnv2.cpp' in line 1963. SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1963. in both reports. And in both cases, the first error occurs after the first restart. Interestingly, this was running in the same slot directory as the previous one (slot 4), and part of my bug report to BOINC (apart from the non-report of stderr_txt) was that the slot directory wasn't cleaned after an abort. I'll make sure that's done properly before I risk another one. | |
ID: 33524 | Rating: 0 | rate: / | |
Sorry guys, I've been (and still am) very busy. Jacob, thanks for the files, they were useful and I know how to fix the problem. Unfortunately, I'll not have opportunity to do any more work on the application for a while. Will keep you posted. | |
ID: 33527 | Rating: 0 | rate: / | |
I have also been experiencing this problem. Over the past several weeks at least. Also glad to see the cause has been identified by the project. Now just waiting for a fix. | |
ID: 33543 | Rating: 0 | rate: / | |
Suspect I had the same problem: Driver resetting in loop eventually blue screen and memory dump. Managed to stop the gpu and spotted MD5 checksum error message associated with some gpugrid logo png file. Probably more to it than a bad logofile download so I reset the project and stopped future work. Problem disappeared on this gtx570 system. Other systems are running gpugrid ok. | |
ID: 33621 | Rating: 0 | rate: / | |
The cause: | |
ID: 33623 | Rating: 0 | rate: / | |
I have edited the cc_config file to include the startup delay and now that delay (60 seconds in my case) is initiated everytime I start BOINC up, whether I had a problem before it was shutdown or not. So now I don't have to try and 'catch' BOINC to abort tasks, or go into safe mode or anything else. I can just abort tasks that I know will fail due to the power interruption issues I occasionally have to deal with here (mostly on my GTX590 box). Operator ____________ | |
ID: 33624 | Rating: 0 | rate: / | |
My computer abruptly restarted a couple times today, and I had to deal with this problem again. | |
ID: 33705 | Rating: 0 | rate: / | |
Jacob, | |
ID: 33769 | Rating: 0 | rate: / | |
Jacob, Thanks. I'm looking forward to the fix. And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!) | |
ID: 33772 | Rating: 0 | rate: / | |
Please hurry MJH. | |
ID: 33784 | Rating: 0 | rate: / | |
And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!) LOL! I recommend using a switch instead (power switch or at the PSU) as these are "debounced" (not sure this is the correct electrical engineering term.. sounds wrong). It could also work to just kill BOINC via task manager - maybe try this before the fix is out :) MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 33820 | Rating: 0 | rate: / | |
And testing the fix should be fun too muaahahahahaha (don't get to yank power cord out of this machine very often!) You got it right. ____________ | |
ID: 33841 | Rating: 0 | rate: / | |
Exactly the same for me. | |
ID: 33845 | Rating: 0 | rate: / | |
I didn't have a power outage, but the computer did restart (the WU's caused the system to restart). | |
ID: 33903 | Rating: 0 | rate: / | |
(the WU's caused the system to restart). That's a bold statement. Have you opted IN to the current beta test of the v8.15 application designed to prevent the endless driver crash loop on restart, however the original problem came about? | |
ID: 33904 | Rating: 0 | rate: / | |
I had also an error with a Santi Rap after 3 "stop and starts". | |
ID: 33905 | Rating: 0 | rate: / | |
Have you opted IN to the current beta test of the v8.15 application... I didn't bother yesterday as I saw that there was only ~10 test WU's released and none available. Since selecting Beta's today, none have come my way, so far, 16/11/2013 14:32:50 | GPUGRID | No tasks are available for ACEMD beta version 16/11/2013 14:48:03 | GPUGRID | No tasks are available for ACEMD beta version 16/11/2013 15:01:34 | GPUGRID | No tasks are available for ACEMD beta version 16/11/2013 15:10:52 | GPUGRID | No tasks are available for ACEMD beta version
GPUGRID ACEMD beta version 1140.53 (5.24%) GPUGRID ACEMD beta version 47.20 (1.12%) GPUGRID ACEMD beta version 544.78 (5.51%)
| |
ID: 33908 | Rating: 0 | rate: / | |
Right. | |
ID: 33909 | Rating: 0 | rate: / | |
Right. And looking at the stderr for the individual task in question, I could see no sign that GPUGrid had crashed or otherwise caused the initial problem, only that it had entered the 'looping driver' state on the first restart. There seem to be more Beta tasks available for testing this afternoon - I have some flagged 'KLAUDE' which look to be heading towards 6-7 hours on my GTX 670s. That should be long enough to trigger a crash for testing purposes :) | |
ID: 33923 | Rating: 0 | rate: / | |
The Betas might fix the driver restarts, but that doesn't address the cause of the system crash/restart - if it is related to the task/app. This seemed to be happening in the past, with certain types of WU; you ran the WU's and the system crashed and drivers restarted on restart, you didn't run those tasks and there weren't any restarts or driver failures. There probably wouldn't be anything in the Boinc logs if the app/WU triggered an immediate system Stop. | |
ID: 33924 | Rating: 0 | rate: / | |
MJH: | |
ID: 33932 | Rating: 0 | rate: / | |
Jacob - | |
ID: 33935 | Rating: 0 | rate: / | |
So, for testing the current app, should I have waited several checkpoints between Tree Kills? | |
ID: 33936 | Rating: 0 | rate: / | |
I hope this gets fixed because cold weather here is causing more frequent power outs and I have a farm of gpugrid systems. | |
ID: 34087 | Rating: 0 | rate: / | |
| |
ID: 34103 | Rating: 0 | rate: / | |
Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors. | |
ID: 34132 | Rating: 0 | rate: / | |
Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors. And I suspect it will be even better when they have enough confidence to promote the restart-fix v8.15 from Beta to stock application. | |
ID: 34133 | Rating: 0 | rate: / | |
Actually the subjective error rate has decreased a lot since the trouble was resolved a few months ago, when Matt developed the app to 8.14. What's left are occasional glitches (like sending WUs to the wrong queue) and from what I'm seeing more isolated and/or special errors. Yeah maybe, but my computer had also a few BSOD's yesterday, with multiple long run WU's. Nothing didn't work after that, I had to delete the whole BOINC folder, the whole driver, clean install everything and after that it finally work again. A lot work for a few long runs in my opinion, I'm glad I've only 1 pc. haha :P The BSOD error had something to do with kernel issues and corrupted the installed NVIDIA driver. So when the computer boots, the screens are freezing down, the driver crashed within a few seconds and after that I had a BSOD again, again and again. I dont know if it's a coincidence that I had or that more people have had the same kind of problems like this. | |
ID: 34137 | Rating: 0 | rate: / | |
FoldingNator: | |
ID: 34138 | Rating: 0 | rate: / | |
I again had this problem today and yesterday. It impacts Windows systems only. | |
ID: 34144 | Rating: 0 | rate: / | |
Hi Jacob/skgiven, thanks for your messages. It sounds the same, but after the driver crash the Windows logfiles said that a part of the driver was corrupted. Though I also doubt it.
Hmmm, sounds like a great idea. ;-) | |
ID: 34146 | Rating: 0 | rate: / | |
This morning I again find that my computer restarted (3days in a row) and when I logged in the NVidia driver repeatedly restarts. One GPUGrid task had completed and wanted to upload (which I also saw yesterday). So it likely that the new task is causing this. | |
ID: 34148 | Rating: 0 | rate: / | |
I have never had the reboot problem on my dedicated BOINC-PC with two GTX 660s running the longs (331.65 drivers, Win7 64-bit). But that PC has an uninterruptable power supply (UPS), and never suffers from power outages. Also, the cards are now stable, after some effort as explained elsewhere, to the point where they now don't have "The simulation has become unstable. Terminating to avoid lock-up" problem. | |
ID: 34150 | Rating: 0 | rate: / | |
Also have a massive problem with reboots now which might be related. The GTX Titan (Win 7 SP1 64bit, driver 331.40) received the following two long runs and then a short run (which I did load on purpose for testing) today: <core_client_version>7.2.31</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX TITAN # ECC : Disabled # Global mem : 4095MB # Capability : 3.5 # PCI ID : 0000:01:00.0 # Device clock : 875MHz # Memory clock : 3004MHz # Memory width : 384bit # Driver version : r331_00 : 33140 # GPU 0 : 49C (...) # GPU 0 : 81C # GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55] # SWAN Device 0 : # Name : GeForce GTX TITAN (see above) # Driver version : r331_00 : 33140 SWAN : FATAL Unable to load module .mshake_kernel.cu. (999) </stderr_txt> ]]> The card ran fine the last week with Milkyway in DP mode. I will try to run some other projects now in SP mode on the Titan to see if the card and the nvidia driver installation is still fine. I will also test if the same problem occurs with the GT 650M card. ____________ Mark my words and remember me. - 11th Hour, Lamb of God | |
ID: 34153 | Rating: 0 | rate: / | |
I will try to run some other projects now in SP mode on the Titan to see if the card and the nvidia driver installation is still fine. Collatz seems to run fine on the Titan with heavy load through config file. Nothing validated yet, but no obvious errors. Will try to catch a new v8.15 short run. EDIT: Got one. Looks good so far, now at 25%. http://www.gpugrid.net/workunit.php?wuid=4978432 I will also test if the same problem occurs with the GT 650M card. Not yet. A v8.14 short runs fine and is at 25% now. | |
ID: 34155 | Rating: 0 | rate: / | |
Will try to catch a new v8.15 short run. I699-SANTI_baxbimSPW2-12-62-RND9134 Nope, another failure. Sudden reboot at 43%. After restart some POEM OpenCL kicked in, hence the nvidia driver and the GPUGRID workunit had no chance this time to crash and I could suspend the workunit in question. The WUProp workunit was killed again, too. This shows at least, that WUProp is killed by the sudden reboot not by the video driver crashing (makes sense to me). If you would like to receive the content of the slot, pls. PM an email address. EDIT: The two POEM workunits finished ok. No indication of a hardware fault or faulty driver, yet. | |
ID: 34156 | Rating: 0 | rate: / | |
I'm also using 331.40 (which is a Beta). Probably worth updating to 331.82 (the most recent WHQL driver), but for me this wasn't happening at the beginning of last week or before that (same drivers). | |
ID: 34159 | Rating: 0 | rate: / | |
but for me this wasn't happening at the beginning of last week or before that (same drivers). Yes, same here. Might consider an update, though... | |
ID: 34161 | Rating: 0 | rate: / | |
I did the suggested update and I'm still getting stung. | |
ID: 34163 | Rating: 0 | rate: / | |
Something else, or maybe the same... Can not find the description of Event ID 1 from source NvStreamSvc. The component that started the event may not be installed on the local computer or the installation is corrupted. You can install the component on the local computer or restore. The computer restarts after a bug check. The bug check is 0x00000116 (0xfffffa801270d010, 0xfffff88006940010, 0x0000000000000000, 0x000000000000000d). A dump was saved in: C: \ Windows \ Minidump \ 120513-6286-01.dmp. Report ID: 120513-6286-01. | |
ID: 34164 | Rating: 0 | rate: / | |
Thanks, skgiven, for sharing the info. I guess I will refrain from it then, at least for the time being. after the BSOD/install older driver all new tasks do have a shorter CPU time. Is this v314.22 you are using now, FoldingNator? Unfortunately us Titan/GTX 780 users have to stick with v331.40 or higher I'm afraid. | |
ID: 34166 | Rating: 0 | rate: / | |
Yes you're right! | |
ID: 34169 | Rating: 0 | rate: / | |
Just suffered the effects of this bug today after a power failure, but it only affected my Win 7 Pro 64 machine, my WIN XP Home 32 restarted the tasks OK. | |
ID: 34172 | Rating: 0 | rate: / | |
Power glitch caused fatal gpugrid restarts on 5 systems a few hours ago. This is a PITA. These systems are headless and when I bring a monitor to fix the problem (reset of gpugrid) the BM program can be off the edge of the screen and by the time I get it down to where I can select gpugrid and reset the project, the damn display has reset 3 times and hung up and I start the process all over again. | |
ID: 34216 | Rating: 0 | rate: / | |
Beemer: | |
ID: 34217 | Rating: 0 | rate: / | |
If you put a not to expensive UPS behind it, then you can safely switch the rigs down on a power outage (if you are near the PC's)? | |
ID: 34219 | Rating: 0 | rate: / | |
If you put a not to expensive UPS behind it, then you can safely switch the rigs down on a power outage (if you are near the PC's)? If you put a quality UPS behind it, it will come with software that can switch the rigs down safely whether you are nearby or not. | |
ID: 34220 | Rating: 0 | rate: / | |
Check out Cyber Power brand UPS. They're more reasonably priced than the APC brand and they even provide an app for Linux that has options to (I assume their Windows app is even better): | |
ID: 34222 | Rating: 0 | rate: / | |
Check out Cyber Power brand UPS. They're more reasonably priced than the APC brand and they even provide an app for Linux that has options to (I assume their Windows app is even better): I like the CyberPower "Pure sine wave" series, which gives a better sine wave output than the others, which give only a stepped-sine wave approximation. The latter can cause trouble with some of the new high-efficiency PC power supplies that have power-factor correction (PFC). I have just replaced an APC UPS 750 with a "CyberPower CP1350PFCLCD UPS 1350VA/810W PFC compatible Pure sine wave" (my second one), since the APC causes an occasional alarm fault with the 90% efficient power supply in that PC. | |
ID: 34228 | Rating: 0 | rate: / | |
I have a small APC that runs the cable modem, switch and WIFI but it cannot be used with the AC powerline ethernet adapter as it filters out the ethernet signal. My systems are not all in one place where there could be serviced by a single backup. | |
ID: 34236 | Rating: 0 | rate: / | |
This thread was regarding a specific problem, as detailed in the first post. | |
ID: 36445 | Rating: 0 | rate: / | |
Message boards : Number crunching : Abrupt computer restart - Tasks stuck - Kernel not found