Message boards : Number crunching : BAD PABLO_p53 WUs
Author | Message |
---|---|
So far I've had 23 of these bad PABLO_p53 WUs today. Think maybe they should be cancelled? | |
ID: 46695 | Rating: 0 | rate: / Reply Quote | |
Same for me. | |
ID: 46696 | Rating: 0 | rate: / Reply Quote | |
I also had 2 bad units from this bunch. | |
ID: 46697 | Rating: 0 | rate: / Reply Quote | |
The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear. That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. | |
ID: 46698 | Rating: 0 | rate: / Reply Quote | |
Same here under Linux: | |
ID: 46699 | Rating: 0 | rate: / Reply Quote | |
My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch). | |
ID: 46701 | Rating: 0 | rate: / Reply Quote | |
That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. They do error quickly, but it kicked me into my daily quota limit right at the beginning of a new day. $#%@^! | |
ID: 46702 | Rating: 0 | rate: / Reply Quote | |
also bad here: | |
ID: 46703 | Rating: 0 | rate: / Reply Quote | |
also bad here: This means that no new tasks can be downloaded before March 22, right? The incident early this morning shows that the policy of the daily quota should be revisited quickly. In the specific case, it results in total nonsense. No idea how many people (I think: many) now are not able to download any new tasks for whole March 21. Rather bad thing. | |
ID: 46704 | Rating: 0 | rate: / Reply Quote | |
This is a generic error, all long workunits failed on all of my hosts too overnight, so all of my hosts are processing short runs now, but the short queue is ran dry already. The incident early this morning shows that the policy of the daily quota should be revisited quickly.The daily quota would decrease to 1 in every case if the project supplies only failing workunits, there's no problem with that. The problem is the waaay too high ratio of bad workunits in the queue. | |
ID: 46705 | Rating: 0 | rate: / Reply Quote | |
Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. | |
ID: 46706 | Rating: 0 | rate: / Reply Quote | |
... so all of my hosts are processing short runs now, but the short queue is ran dry already. I switched to short runs in the early morning when, according to the Project Status Page, some were still available. However, the download of those was again refused referring to the "daily quota of 3 tasks" :-( | |
ID: 46707 | Rating: 0 | rate: / Reply Quote | |
I've had 9 Pablos fail on me. I'm only receiving short runs now. Returning to Einstein until this is sorted. | |
ID: 46708 | Rating: 0 | rate: / Reply Quote | |
Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. There is nothing you can do just wait 24hrs. There is not a problem with the Daily Quota" there is a massive problem with the dumping into the queue of faulty WU's. The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured. | |
ID: 46709 | Rating: 0 | rate: / Reply Quote | |
The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.It seems that the workunits gone wrong at a certain point, but it wasn't clear that it would affect every batch. It took a couple of hours while the error spread out wide, now the situation is clear. It's very easy to be wise retrospectively. BTW I've sent a notification email to a member of the staff. | |
ID: 46711 | Rating: 0 | rate: / Reply Quote | |
There is nothing you can do just wait 24hrs. the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock. I guess quite a number of crunchers would be pleased if this was possible. | |
ID: 46714 | Rating: 0 | rate: / Reply Quote | |
There is nothing you can do just wait 24hrs. They have to cure the underlying problem first - that's the priority when something like this happens. | |
ID: 46716 | Rating: 0 | rate: / Reply Quote | |
Oh crappity crap. I think I messed up the adaptive. Will check now. Thanks for pointing it out! | |
ID: 46718 | Rating: 0 | rate: / Reply Quote | |
They have to cure the underlying problem first - that's the priority when something like this happens. that's clear to me - first the problem must be fixed, and then, if possible, some kind of "reset" should be made so that all crunchers for which the downloads were stopped could download new tasks again. Although I am afraid that this will not be possible. | |
ID: 46719 | Rating: 0 | rate: / Reply Quote | |
Well these broken tasks will have to run their course. But they crash on start so they should be gone very quickly now. I fixed the bugs and we will restart the adaptives in a moment | |
ID: 46720 | Rating: 0 | rate: / Reply Quote | |
People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully. | |
ID: 46721 | Rating: 0 | rate: / Reply Quote | |
People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully. Nobody on this project is restricted to one task a day but they are restricted to 2 a day because of the way computers count. 0 = 1, 1 = 2, etc | |
ID: 46722 | Rating: 0 | rate: / Reply Quote | |
My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch). It's beyond annoying. I now have 6 hosts that won't get tasks because of these bad WUs. Two of the hosts not getting tasks are the fastest ones with 1060 GPUs. Irritating. | |
ID: 46723 | Rating: 0 | rate: / Reply Quote | |
People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully. I have tried this, however, without success. The only differnce to what it was before is that now the BOINC notice does no longer refer to the task limit per day, but simply says 21/03/2017 18:36:42 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card) Why so? | |
ID: 46725 | Rating: 0 | rate: / Reply Quote | |
Even worse is that all my linux host got coproc error which means bad batch crash drivers.So other project did fail to. | |
ID: 46726 | Rating: 0 | rate: / Reply Quote | |
for some reason, 2 of my PCs now received new tasks, one of them was a | |
ID: 46727 | Rating: 0 | rate: / Reply Quote | |
It's ridiculous that these bad tasks weren't canceled. How many machines have been denied work because of this laziness on the admins part? I've personally received 137 of these bad WUs so far and now have 7 machines not accepting long WUs. Multiply this by how many users? This kind of thing can also happen at other projects but they cancel the the bad WUs when informed of the problem. Why not here? | |
ID: 46729 | Rating: 0 | rate: / Reply Quote | |
And still, for hours, the junk keeps rolling out. | |
ID: 46730 | Rating: 0 | rate: / Reply Quote | |
They should eliminate the "daily quota" for this particular situation. and let us crunch! | |
ID: 46731 | Rating: 0 | rate: / Reply Quote | |
Well these broken tasks will have to run their course.That will be a long and frustrating process, as every host can have only one workunit per day, but right now 9 out of 10 workunits is a broken one (so the daily quota of the hosts won't rise for a while), and every workunit has to fail 7 times before it's cleared from the queue. To speed this up, I've created dummy hosts with my inactive host, and I've "killed" about 100 of these broken workunits. I had to abort some working units, but these are the minority right now. | |
ID: 46734 | Rating: 0 | rate: / Reply Quote | |
The situation here still unchanged. | |
ID: 46735 | Rating: 0 | rate: / Reply Quote | |
Relax everone, we are where we are,I'm sure the admins are as frustrated as ourselves and are working to correct the situation. | |
ID: 46736 | Rating: 0 | rate: / Reply Quote | |
short WU's jus got a boost in computation. what does it help if they cannot be downloaded? | |
ID: 46737 | Rating: 0 | rate: / Reply Quote | |
They can be downloaded, actually. | |
ID: 46738 | Rating: 0 | rate: / Reply Quote | |
They can be downloaded, actually. NOT on my machines. There comes the same notice re "daily quota of 1 task" ... :-( | |
ID: 46739 | Rating: 0 | rate: / Reply Quote | |
They can be downloaded, actually. The quota is applied per task type. You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated - I have two machines running them at the moment, because of that. Here are the log entries from one of the affected machines: 22/03/2017 09:51:04 | GPUGRID | This computer has finished a daily quota of 1 tasks 22/03/2017 10:13:27 | GPUGRID | Scheduler request completed: got 2 new tasks 22/03/2017 10:13:27 | GPUGRID | No tasks are available for the applications you have selected 22/03/2017 10:13:27 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card) 22/03/2017 10:13:27 | GPUGRID | Your preferences allow tasks from applications other than those selected | |
ID: 46740 | Rating: 0 | rate: / Reply Quote | |
They can be downloaded, actually. In addition to Richards response you have Long WU's running on three out of four of your machines. What more exactly do you want? | |
ID: 46741 | Rating: 0 | rate: / Reply Quote | |
What is shown in my log is unfortunately wrong. | |
ID: 46742 | Rating: 0 | rate: / Reply Quote | |
Further, the log shows that a | |
ID: 46743 | Rating: 0 | rate: / Reply Quote | |
You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated that's what BOINC is showing me: 22/03/2017 13:12:42 | GPUGRID | No tasks are available for Short runs (2-3 hours on fastest card) 22/03/2017 13:12:42 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card) 22/03/2017 13:12:42 | GPUGRID | This computer has finished a daily quota of 1 tasks So I doubt that could get short runs. (your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened). | |
ID: 46744 | Rating: 0 | rate: / Reply Quote | |
To my surprise, the faulty / working ratio is much better than I've expected. | |
ID: 46745 | Rating: 0 | rate: / Reply Quote | |
... so the daily quota will be recovered in a couple of days. still it's a shame that there is no other mechanism in place for cases like the present one :-( | |
ID: 46746 | Rating: 0 | rate: / Reply Quote | |
The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated | |
ID: 46747 | Rating: 0 | rate: / Reply Quote | |
You can't prepare a system to every abnormal situation. BTW you'll receive workunits while your daily quota is lower than its maximum. The only important factor is that a host should not receive many faulty workunits in a row, because it will "blacklist" that host for a day. This is a pretty good automatism to minimize the effects of a faulty host, as such a host would exhaust the queues in a very short time if there's nothing to limit the work assigned to a faulty host. Too bad that this generic error combined with this self-defense made all of our hosts blacklisted, but there's no defense of this self-defense. I've realized that we are this "device", which could make this project running in such regrettable situations.... so the daily quota will be recovered in a couple of days. | |
ID: 46748 | Rating: 0 | rate: / Reply Quote | |
When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-) | |
ID: 46749 | Rating: 0 | rate: / Reply Quote | |
When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)Indeed. This should be a special one, with special design. I think of a crashed bug. :) | |
ID: 46750 | Rating: 0 | rate: / Reply Quote | |
The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours. Sometimes you get a working long task, sometimes you get a faulty long task, sometimes you get a short task - it's very much the luck of the draw at the moment. I've had all three outcomes within the last hour. | |
ID: 46751 | Rating: 0 | rate: / Reply Quote | |
[quote]... sometimes you get a faulty long task this leads me to repeating my question: why were/are the faulty ones not eliminated from the queue? | |
ID: 46752 | Rating: 0 | rate: / Reply Quote | |
why were/are the faulty ones not eliminated from the queue? My guess - and it is only a guess - is that the currently-available staff are all biochemical researchers, rather than specialist database repairers. BOINC server code provides tools for researchers to submit jobs directly, but identifying faulty (and only faulty) workunits for cancellation is a tricky business. We've had cases in the past when batches of tasks have been cancelled en bloc, including tasks in the middle of an apparently viable run. That caused even more vociferous complaints (of wasted electricity) than the current forced diversion of BOINC resources to other backup projects. Amateur meddling in technical matters (anything outside your personal professional skill) can cause more problems than it's worth. Stefan has owned up to making a mistake in preparing the workunit parameters: he has corrected that error, but he seems to have decided - wisely, in my opinion - not to risk dabbling in areas where he doesn't feel comfortable about his own level of expertise. | |
ID: 46753 | Rating: 0 | rate: / Reply Quote | |
@Richard: what you are saying sounds logical | |
ID: 46754 | Rating: 0 | rate: / Reply Quote | |
Though I once posted some good numbers to the project, I've been away for awhile and lost track of how BOINC ought to work. | |
ID: 46755 | Rating: 0 | rate: / Reply Quote | |
I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit. | |
ID: 46756 | Rating: 0 | rate: / Reply Quote | |
I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit.If there's the ERROR: file mdioload.cpp line 81: Unable to read bincoordfile message in many of the previous task's stderr.txt output file, then it's a faulty task.The one you've received is failed 4 times, from 3 different reasons (but none of them is the one above): 1st & 3rd: <message>
process exited with code 201 (0xc9, -55)
</message>
<stderr_txt>
# Unable to initialise. Check permissions on /dev/nvidia* (err=100)
</stderr_txt> 2nd (that's the most mysterious:) <message>
process exited with code 212 (0xd4, -44)
</message>
<stderr_txt>
</stderr_txt> 4th: <message>
(unknown error) - exit code -80 (0xffffffb0)
</message>
<stderr_txt>
...
# Access violation : progress made, try to restart
called boinc_finish
</stderr_txt> BTW things are now back to normal (almost), some faulty workunits are still floating around. | |
ID: 46757 | Rating: 0 | rate: / Reply Quote | |
Has the problem been fixed? | |
ID: 46760 | Rating: 0 | rate: / Reply Quote | |
Has the problem been fixed?Yes. There still could be some faulty workunits in the long queue, but those are not threatening the daily quota. | |
ID: 46761 | Rating: 0 | rate: / Reply Quote | |
These error units are starting to disappear from the tasks pages. Soon, they will be all gone, nothing more than a memory. | |
ID: 46799 | Rating: 0 | rate: / Reply Quote | |
Trouble is, I'm starting to ses a new bad batch, like | |
ID: 46803 | Rating: 0 | rate: / Reply Quote | |
I just got one an hour ago, that failed after two seconds. | |
ID: 46804 | Rating: 0 | rate: / Reply Quote | |
e1s9_ubiquitin_100ns_8-ADRIA_FOLDGREED10_crystal_ss_contacts_100_ubiquitin_4-0-2-RND2702 is running OK, so they're not all bad. | |
ID: 46805 | Rating: 0 | rate: / Reply Quote | |
Same here, 6 broken Adria WU out of 8, in 12 hours so far. Failing immediately. | |
ID: 46806 | Rating: 0 | rate: / Reply Quote | |
,,,and here too | |
ID: 46811 | Rating: 0 | rate: / Reply Quote | |
That's more likely to have been your hosts fault rather than the WU. Maybe due to power failure or hard shutdown. The other one says: (simulation unstable) maybe due to overclocking. | |
ID: 46812 | Rating: 0 | rate: / Reply Quote | |
During the past days, there are still WUs which fail after a few seconds. | |
ID: 46839 | Rating: 0 | rate: / Reply Quote | |
During the past days, there are still WUs which fail after a few seconds. This is still happening, just had 2 in a row | |
ID: 46865 | Rating: 0 | rate: / Reply Quote | |
This is still happening, just had 2 in a row yes, this new problem is being discussed here: http://www.gpugrid.net/forum_thread.php?id=4545 | |
ID: 46866 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : BAD PABLO_p53 WUs