Message boards : Number crunching : All Gerard WUs erroring
Author | Message |
---|---|
Hi, | |
ID: 42537 | Rating: 0 | rate: / Reply Quote | |
Also for windows, error message | |
ID: 42538 | Rating: 0 | rate: / Reply Quote | |
Hi, Yes, there seems to be a batch of WUs, that are failing on previously reliable Linux machines and some mostly bad windows hosts, but they are running fine on my windows computers. One has already completed successfully at this time. See links: https://www.gpugrid.net/workunit.php?wuid=11397999 https://www.gpugrid.net/workunit.php?wuid=11398213 https://www.gpugrid.net/workunit.php?wuid=11398820 https://www.gpugrid.net/workunit.php?wuid=11398294 | |
ID: 42539 | Rating: 0 | rate: / Reply Quote | |
On my Windows 7 machine, (I7-3770, GTX 980) I currently had ~10 GERALD WU (more in the cue and still comming in) that were running @ less then %1 GPU usage (according to GPU-Z) while the progress in the BOINC manager appeared to be normal/a little slow (~15 hour estimation per WU). All these WU suddenly disappeared from the BOINC manager without any error massage and also without showing up in my results in my GPUGRID stats. Certainly there is something flawed with these WUs! | |
ID: 42540 | Rating: 0 | rate: / Reply Quote | |
I missed the other WUs, but right now this happened to the WU: | |
ID: 42541 | Rating: 0 | rate: / Reply Quote | |
I have both kind of these WUs: | |
ID: 42542 | Rating: 0 | rate: / Reply Quote | |
I have both kind of these WUs: So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem. For the record, I have 2 errors since the new year. All WUs on my machines are currently running okay and I hope it stays that way!. So, I would say that I ran into 2 bad WUs. | |
ID: 42546 | Rating: 0 | rate: / Reply Quote | |
I've found the same behavior in my linux hosts, in WUs received since Jan-02-2016 past midday. | |
ID: 42547 | Rating: 0 | rate: / Reply Quote | |
So how many errors did you get recently? If it's a small number, you could attribute that to running into the occasional bad WU. If you have a lot more, than it's more than just a linux problem.I have four errors recently. It's a bit more than usual. The two aborted WUs are my fault. | |
ID: 42549 | Rating: 0 | rate: / Reply Quote | |
I haven't seen the problem yet on a pair of GTX 960s. | |
ID: 42550 | Rating: 0 | rate: / Reply Quote | |
14814161 11399908 286919 3 Jan 2016 | 16:38:22 UTC 3 Jan 2016 | 16:39:05 UTC Error while computing 0.00 0.00 --- Long runs | |
ID: 42551 | Rating: 0 | rate: / Reply Quote | |
I have seen the same thing on my linux host - all work units since the one below error out the same way | |
ID: 42553 | Rating: 0 | rate: / Reply Quote | |
Hello , I'm using ubuntu 14.04 lts. Gerard-WU's stopped after 1 second and were uploaded. "Output file was absent" for four files at a time. I did some collatz conjecture earlier today but I guess that didn't mess up my computer as others are having problems too. | |
ID: 42554 | Rating: 0 | rate: / Reply Quote | |
Not sure if it's related, but I too just had an error with a Gerard unit, which is a rare thing to happen for me. Name e3s31_e2s25p1f424-GERARD_CXCL12_CHALC4_DIM1-0-1-RND7047_1 | |
ID: 42556 | Rating: 0 | rate: / Reply Quote | |
Since 16:48 UTC on the second of January, my Linux host(206986) has failed all WU's it has received. My 2 Windows hosts are working as usual. A quick look through the task lists for the top 10 users shows the same pattern. Has anyone come up with a theory as to what is happening? In the meantime I have set that host to NNT to avoid causing any congestion at the server-side. | |
ID: 42561 | Rating: 0 | rate: / Reply Quote | |
This issue continues ocurring in all my hosts (Linux). | |
ID: 42562 | Rating: 0 | rate: / Reply Quote | |
The Linux app binary has expired and needs to be updated. I'll get that done tomorrow, hopefully. | |
ID: 42563 | Rating: 0 | rate: / Reply Quote | |
Thanks Matt. Hope the update will improve it's performance | |
ID: 42565 | Rating: 0 | rate: / Reply Quote | |
WU e15s19_e14s24p1f286-GERARD_CXCL12_DIMPROTO3-0-1-RND3500_2 has been stuck at '85% "progress" ' for some 12 hours now. | |
ID: 42566 | Rating: 0 | rate: / Reply Quote | |
I'd suggest restarting the PC. And if the problem still persists, then abort the task. | |
ID: 42567 | Rating: 0 | rate: / Reply Quote | |
Hi, | |
ID: 42573 | Rating: 0 | rate: / Reply Quote | |
Yep, the GPUGRID tasks are still not processing on my Linux boxes. | |
ID: 42574 | Rating: 0 | rate: / Reply Quote | |
Okay, thanks for the update. I guess I'll just add a backup project myself until the issue is resolved. | |
ID: 42576 | Rating: 0 | rate: / Reply Quote | |
Same here, no joy for Linux hosts, five days in a row, we don't seem to be anything worthy for the project. | |
ID: 42577 | Rating: 0 | rate: / Reply Quote | |
The tasks were doing OK on my XP box. I moved the cards to a Win7 box and now they all error out in 2 seconds. Looks like moving the cards was a mistake but I can't move them back ATM. | |
ID: 42583 | Rating: 0 | rate: / Reply Quote | |
I managed to download 1 task that didn't error out in 2 seconds. *fingers crossed* | |
ID: 42585 | Rating: 0 | rate: / Reply Quote | |
Same here on Linux. All WUs error out, even after a reset of the project. | |
ID: 42587 | Rating: 0 | rate: / Reply Quote | |
Hi nanoprobe, | |
ID: 42591 | Rating: 0 | rate: / Reply Quote | |
Still not running under linux ... | |
ID: 42597 | Rating: 0 | rate: / Reply Quote | |
Hi nanoprobe, Yes. It had previously errored out on a Linux machine with 0 runtime and a Windows machine after about 60 minutes of run time. I have also received 6 more since that one that have completed and currently have 2 more in progress. For me all the version 8.4.1 tasks error out. Version 8.4.7 tasks seem to run fine with only an occasional error and unfortunately they run for hours before they go south. | |
ID: 42598 | Rating: 0 | rate: / Reply Quote | |
One day more without Linux crunching and without status info...who cares? | |
ID: 42600 | Rating: 0 | rate: / Reply Quote | |
One day more without Linux crunching and without status info...who cares? Someone didn't get their nap today. | |
ID: 42602 | Rating: 0 | rate: / Reply Quote | |
One day more without Linux crunching and without status info...who cares? No, don't be a jerk. This has been a known problem with a known cause for a week now and no one has bothered to fix it. For many years there was a significant performance boost when crunching with Linux at this project. The developers actually recommended that you crunch with Linux. Many of us have dedicated Linux hosts to this project due to that fact. Now my Linux hosts are having to crunch mathematics crap and look for pulsars to keep my house warm. Could someone please fix this? | |
ID: 42615 | Rating: 0 | rate: / Reply Quote | |
One day more without Linux crunching and without status info...who cares? No nap and lost your sense of humor? Go look in a mirror and take a chill pill man. This ain't life or death and GPUGrid doesn't revolve around you. | |
ID: 42616 | Rating: 0 | rate: / Reply Quote | |
That was weird. Triple post.????? | |
ID: 42617 | Rating: 0 | rate: / Reply Quote | |
Can't explain the triple post. | |
ID: 42618 | Rating: 0 | rate: / Reply Quote | |
Can't explain the triple post. You missed your nap? :) | |
ID: 42619 | Rating: 0 | rate: / Reply Quote | |
Guys! Matt is trying to fix it, see https://www.gpugrid.net/forum_thread.php?id=4235 . Apparently the solution must not be trivial. Please be patient! | |
ID: 42620 | Rating: 0 | rate: / Reply Quote | |
Now I am getting this same "linux" error on my both my windows machines. | |
ID: 42621 | Rating: 0 | rate: / Reply Quote | |
Can't explain the triple post. Or I fell asleep at the keyboard. ;-) FWIW most of the tasks I'm getting are resends that have failed at least once on a Linux host. So far they have all run to completion on my host. Win7, Xeon E5 2683, twin GTX 970. Along with GPUGrid tasks I'm also running a full load of CPU tasks minus 2 threads each for the cards if that means anything. | |
ID: 42622 | Rating: 0 | rate: / Reply Quote | |
Six errors in the last day on Windows, so has nothing to do with the Linux app. | |
ID: 42625 | Rating: 0 | rate: / Reply Quote | |
New app 848. | |
ID: 42626 | Rating: 0 | rate: / Reply Quote | |
New app 848. Matt, Is it just me or are others having download issues? From logs: 13444 GPUGRID 1/15/2016 10:05:43 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1: transient HTTP error 13445 GPUGRID 1/15/2016 10:05:43 AM Backing off 00:02:20 on download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-e1s22_2-GERARD_A2ARFX_luf6806_b1-0-2-RND6779_1 13446 GPUGRID 1/15/2016 10:05:44 AM Temporarily failed download of e1s22_2-GERARD_A2ARFX_luf6806_b1-1-psf_file: transient HTTP error I'm only having this issue here. Sometimes it takes hours to get all the files for 1 task to run. | |
ID: 42627 | Rating: 0 | rate: / Reply Quote | |
Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. | |
ID: 42628 | Rating: 0 | rate: / Reply Quote | |
New app 848. working OK in my host | |
ID: 42647 | Rating: 0 | rate: / Reply Quote | |
New app 848. Same here on my Linux hosts. | |
ID: 42650 | Rating: 0 | rate: / Reply Quote | |
Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder. At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on. As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]). And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org | |
ID: 42678 | Rating: 0 | rate: / Reply Quote | |
Across 5 computers, all windows ranging from 7, 8.1, and 10, I have had about 130 errored out WUs in the past 24-30 hours. Over 100 of them are 0.00 second errors and the other 30 or so are with time put in. One of the computers crashed to an unrecoverable crash and needed manual assistance in the BIOS then the OS to get it back to a good state and running again. This caused aborted WUs or was caused by WUs. I didn't get a chance to check logs and drivers and recent updates and stuff, I just got it running again and walked away frustrated. Maybe a huge coincidence and maybe not, another one of my computers died and would not reboot to anything but the BIOS. The one mentioned above is a Windows 7 and this one was a Windows 10, but both have the same processor (i7-4790K) and motherboard (Asus Z97-AR). Both are just about 11 months old. The first one I was able to default the BIOS, then got it to start in Safe Mode, then without making changes it was able to reboot back to full windows with no problems. This one stick of memory went bad but even with completely replaced memory, it needed BOINC completely uninstalled and all traces deleted to get it running outside of Safe Mode. I had to disable BOINC from starting with Sysinternals Autoruns, then uninstall and delete in full mode. After that I reinstalled BOINC and added the projects again. It simply would not start again and not freeze until the full deletion. I suspect it was the actual WUs it was crunching that were marked as abandoned when I did that deletion of the Program Data BOINC folder. At first I didn't post here about it because when I found it was a bad memory stick keeping it from starting, I suspected the power issue we had at the house that day. And the memory may very well have been from the power issue. But the not starting until I dumped BOINC, but had several restarts with it disabled, then froze as soon as I started BOINC manually several times, And even after uninstallations of BOINC and all the NVIDIA drivers and services, I narrowed the problem down to BOINC to the point of the freezing. Inside that, I can only suspect the exact reason. I know that memory crashes can cause programs to 'break' if part of the program is still in memory at the time of the crash and not recoverable, but with BOINC's checkpoints, I would suspect this not a problem after a reinstall if the program itself was damaged, the WU it was crunching should go back to the most recent checkpoint and continue on. As this is an old subject now related to a specific batch of WUs, this may be a moot point, but thought it worth a late mention. Just in case there are still these units roaming around and computers that run or ran into problems than may only be found late also, this can serve as maybe some answer and possibly a solution (with the Autoruns in Safe Mode [With Networking if you need to download it, found with a google search "Sysinternals Autoruns"]). And as always, any feedback, if this can help fix an issue or you can help me avoid them, is appreciated. And if I can confirm or help deny and suspicions on something I forgot to mention or may be able to answer, please also offer questions. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org | |
ID: 42679 | Rating: 0 | rate: / Reply Quote | |
Now I am the one who can't explain the double post. lol | |
ID: 42680 | Rating: 0 | rate: / Reply Quote | |
Matt: | |
ID: 42760 | Rating: 0 | rate: / Reply Quote | |
I haven't had any issues since the new app was released. | |
ID: 42792 | Rating: 0 | rate: / Reply Quote | |
I have on both machines running here. Can someone at least look into this and try to fix it? | |
ID: 42809 | Rating: 0 | rate: / Reply Quote | |
I have on both machines running here. Can someone at least look into this and try to fix it?Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden. Have you tried resetting the project? That sometimes clears up http errors. | |
ID: 42810 | Rating: 0 | rate: / Reply Quote | |
I have on both machines running here. Can someone at least look into this and try to fix it?Non stop successful work all month for me. I can't see your machines to see if there is a useful error message since you have them hidden. I'll unhide my machines but I don't see how that will help. The error message is always the same. Rebooting/resetting doesn't help. Here are the latest 2. 138 GPUGRID 2/13/2016 3:59:44 PM Temporarily failed download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error 139 GPUGRID 2/13/2016 3:59:44 PM Backing off 00:02:33 on download of e19s5_e17s11p1f237-GERARD_CXCL12_CHLKDER_mol01-0-psf_file 140 GPUGRID 2/13/2016 3:59:47 PM Temporarily failed download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file: transient HTTP error 141 GPUGRID 2/13/2016 3:59:47 PM Backing off 00:02:29 on download of e19s6_e17s11p1f311-GERARD_CXCL12_CHLKDER_mol01-0-psf_file 142 2/13/2016 3:59:48 PM Project communication failed: attempting access to reference site 143 2/13/2016 3:59:49 PM Internet access OK - project servers may be temporarily down. | |
ID: 42814 | Rating: 0 | rate: / Reply Quote | |
I've seen this occasionally, but as far as I know have not had files stuck trying to download for hours as was mentioned in another thread. When I've seen it before it resolved in a few minutes. Tonight I observed a couple tasks stuck (probably around ten files in total), with a backoff time of one hour and 45 minutes. I manually tried downloading again and most of them completed, but a couple hung with the http transient error. I still have one file left which I guess will eventually finish. All my cards are busy so it's not a problem right now. | |
ID: 42815 | Rating: 0 | rate: / Reply Quote | |
You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download. | |
ID: 42816 | Rating: 0 | rate: / Reply Quote | |
Looked in the log. Can't find any recent stuck files that stick for hours thankfully, like I've got in the past. However to show that it's still happening, here's some recent ones that were stuck for a shorter time: | |
ID: 42817 | Rating: 0 | rate: / Reply Quote | |
You are right. There is nothing I can see from your tasks that would explain the errors. All your task list shows are the successful tasks, not the ones that it could not download. They all eventually download and run to completion. It's waiting for hours while the downloads are stuck that is the issue. I can't offer any suggestions to find out where the issue is in the network. I can only tell you that everything in my path from my machine to GPUGrid does not exhibit that behavior any more. It did for a while a few weeks back, but it seems to have resolved itself with no action on my part. I wish the issue would "resolve itself" but so far that has not happened. | |
ID: 42819 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : All Gerard WUs erroring