Advanced search

Message boards : Number crunching : BAD PABLO_p53 WUs

Author Message
Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46695 - Posted: 20 Mar 2017 | 20:42:47 UTC

So far I've had 23 of these bad PABLO_p53 WUs today. Think maybe they should be cancelled?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46696 - Posted: 20 Mar 2017 | 21:32:08 UTC - in response to Message 46695.

Same for me.
http://www.gpugrid.net/forum_thread.php?id=4513&nowrap=true#46692

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,084,353,479
RAC: 15,568,836
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46697 - Posted: 20 Mar 2017 | 21:43:59 UTC

I also had 2 bad units from this bunch.

The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear.



Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46698 - Posted: 20 Mar 2017 | 21:47:27 UTC - in response to Message 46697.
Last modified: 20 Mar 2017 | 21:47:40 UTC

The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear.

That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly.

Profile koschi
Avatar
Send message
Joined: 14 Aug 08
Posts: 124
Credit: 792,979,198
RAC: 17,226
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 46699 - Posted: 20 Mar 2017 | 22:43:27 UTC

Same here under Linux:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46701 - Posted: 21 Mar 2017 | 1:42:13 UTC

My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch).
Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project.
This is really annoying.
This batch was working fine previously.

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,646,134,111
RAC: 20,072,923
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46702 - Posted: 21 Mar 2017 | 4:20:17 UTC - in response to Message 46698.

That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly.


They do error quickly, but it kicked me into my daily quota limit right at the beginning of a new day. $#%@^!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46703 - Posted: 21 Mar 2017 | 4:23:57 UTC

also bad here:
for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-(((

This means that no new task can be downloaded before March 22, right?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46704 - Posted: 21 Mar 2017 | 7:27:43 UTC - in response to Message 46703.

also bad here:
for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-(((

This means that no new tasks can be downloaded before March 22, right?


The incident early this morning shows that the policy of the daily quota should be revisited quickly.
In the specific case, it results in total nonsense.

No idea how many people (I think: many) now are not able to download any new tasks for whole March 21. Rather bad thing.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46705 - Posted: 21 Mar 2017 | 9:49:54 UTC
Last modified: 21 Mar 2017 | 9:50:23 UTC

This is a generic error, all long workunits failed on all of my hosts too overnight, so all of my hosts are processing short runs now, but the short queue is ran dry already.

The incident early this morning shows that the policy of the daily quota should be revisited quickly.
In the specific case, it results in total nonsense.
The daily quota would decrease to 1 in every case if the project supplies only failing workunits, there's no problem with that. The problem is the waaay too high ratio of bad workunits in the queue.

Lluis
Send message
Joined: 22 Feb 14
Posts: 26
Credit: 672,639,304
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 46706 - Posted: 21 Mar 2017 | 10:52:52 UTC - in response to Message 46705.

Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process.
Anyone has an idea of what to do? Any advice (other than process short units)?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46707 - Posted: 21 Mar 2017 | 11:32:47 UTC - in response to Message 46705.

... so all of my hosts are processing short runs now, but the short queue is ran dry already.

I switched to short runs in the early morning when, according to the Project Status Page, some were still available.
However, the download of those was again refused referring to the "daily quota of 3 tasks" :-(

Matt
Avatar
Send message
Joined: 11 Jan 13
Posts: 216
Credit: 846,538,252
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46708 - Posted: 21 Mar 2017 | 11:54:32 UTC

I've had 9 Pablos fail on me. I'm only receiving short runs now. Returning to Einstein until this is sorted.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46709 - Posted: 21 Mar 2017 | 11:54:40 UTC - in response to Message 46706.

Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process.
Anyone has an idea of what to do? Any advice (other than process short units)?


There is nothing you can do just wait 24hrs. There is not a problem with the Daily Quota" there is a massive problem with the dumping into the queue of faulty WU's.

The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46711 - Posted: 21 Mar 2017 | 12:09:04 UTC - in response to Message 46709.

The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.
It seems that the workunits gone wrong at a certain point, but it wasn't clear that it would affect every batch. It took a couple of hours while the error spread out wide, now the situation is clear. It's very easy to be wise retrospectively.
BTW I've sent a notification email to a member of the staff.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46714 - Posted: 21 Mar 2017 | 12:42:22 UTC - in response to Message 46709.

There is nothing you can do just wait 24hrs.

the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock.
I guess quite a number of crunchers would be pleased if this was possible.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46716 - Posted: 21 Mar 2017 | 12:50:43 UTC - in response to Message 46714.

There is nothing you can do just wait 24hrs.

the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock.
I guess quite a number of crunchers would be pleased if this was possible.

They have to cure the underlying problem first - that's the priority when something like this happens.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 46718 - Posted: 21 Mar 2017 | 13:15:49 UTC

Oh crappity crap. I think I messed up the adaptive. Will check now. Thanks for pointing it out!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46719 - Posted: 21 Mar 2017 | 14:13:04 UTC - in response to Message 46716.

They have to cure the underlying problem first - that's the priority when something like this happens.

that's clear to me - first the problem must be fixed, and then, if possible, some kind of "reset" should be made so that all crunchers for which the downloads were stopped could download new tasks again.

Although I am afraid that this will not be possible.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 46720 - Posted: 21 Mar 2017 | 14:21:55 UTC

Well these broken tasks will have to run their course. But they crash on start so they should be gone very quickly now. I fixed the bugs and we will restart the adaptives in a moment

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46721 - Posted: 21 Mar 2017 | 14:48:12 UTC

People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully.

BOINC rebuilds the quota quickly when tasks are returned successfully, but if you're restricted to one task per day, and that one turns out to be a faulty one, you're stuck for another 24 hours.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46722 - Posted: 21 Mar 2017 | 15:10:19 UTC - in response to Message 46721.

People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully.

BOINC rebuilds the quota quickly when tasks are returned successfully, but if you're restricted to one task per day, and that one turns out to be a faulty one, you're stuck for another 24 hours.


Nobody on this project is restricted to one task a day but they are restricted to 2 a day because of the way computers count. 0 = 1, 1 = 2, etc

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46723 - Posted: 21 Mar 2017 | 15:19:15 UTC - in response to Message 46701.
Last modified: 21 Mar 2017 | 15:19:43 UTC

My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch).
Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project.
This is really annoying.

It's beyond annoying. I now have 6 hosts that won't get tasks because of these bad WUs. Two of the hosts not getting tasks are the fastest ones with 1060 GPUs. Irritating.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46725 - Posted: 21 Mar 2017 | 17:38:16 UTC - in response to Message 46721.

People who are having task requests rejected because their quota is exhausted may wish to set 'No New Tasks' until they read that the faulty tasks have been flushed, and these new tasks are running successfully.

BOINC rebuilds the quota quickly when tasks are returned successfully, but if you're restricted to one task per day, and that one turns out to be a faulty one, you're stuck for another 24 hours.

I have tried this, however, without success.
The only differnce to what it was before is that now the BOINC notice does no longer refer to the task limit per day, but simply says

21/03/2017 18:36:42 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)

Why so?

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,192,102,249
RAC: 13,992,829
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 46726 - Posted: 21 Mar 2017 | 17:43:52 UTC - in response to Message 46723.

Even worse is that all my linux host got coproc error which means bad batch crash drivers.So other project did fail to.

A restart is now done and might crash again if there still is task out.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46727 - Posted: 21 Mar 2017 | 17:48:55 UTC
Last modified: 21 Mar 2017 | 17:57:19 UTC

for some reason, 2 of my PCs now received new tasks, one of them was a
PABLO_contact_goal_KIX_CMYB

and even this one failed after a few seconds.
Till now I thought that only PABLO_p53 tasks are affected.

Edit: Only now I realize that on others of my PCs, during the day, had same probleme with all kinds of different WUs, not only PABLO_93.

Can it be that all recent WUs, regardless of the type, were faulty?

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46729 - Posted: 21 Mar 2017 | 21:24:25 UTC

It's ridiculous that these bad tasks weren't canceled. How many machines have been denied work because of this laziness on the admins part? I've personally received 137 of these bad WUs so far and now have 7 machines not accepting long WUs. Multiply this by how many users? This kind of thing can also happen at other projects but they cancel the the bad WUs when informed of the problem. Why not here?

Tom Miller
Send message
Joined: 21 Nov 14
Posts: 5
Credit: 1,081,640,766
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwat
Message 46730 - Posted: 21 Mar 2017 | 23:00:49 UTC - in response to Message 46720.

And still, for hours, the junk keeps rolling out.

If we the volunteers who donate our GPUs and electrons to help in what we're to believe is real science, I would hope the people using our resources would maybe have a little better way of administering them.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,084,353,479
RAC: 15,568,836
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46731 - Posted: 22 Mar 2017 | 0:51:49 UTC

They should eliminate the "daily quota" for this particular situation. and let us crunch!



Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46734 - Posted: 22 Mar 2017 | 4:00:31 UTC - in response to Message 46720.
Last modified: 22 Mar 2017 | 4:10:57 UTC

Well these broken tasks will have to run their course.
That will be a long and frustrating process, as every host can have only one workunit per day, but right now 9 out of 10 workunits is a broken one (so the daily quota of the hosts won't rise for a while), and every workunit has to fail 7 times before it's cleared from the queue.
To speed this up, I've created dummy hosts with my inactive host, and I've "killed" about 100 of these broken workunits. I had to abort some working units, but these are the minority right now.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46735 - Posted: 22 Mar 2017 | 4:18:50 UTC

The situation here still unchanged.
One of my 4 hosts luckily got a "good" WU some time last night and is crunching it.
On all other hosts BOINC still tells me

22/03/2017 05:14:41 | GPUGRID | This computer has finished a daily quota of 1 tasks

What I don't understand is why all these broken WUs cannot be removed from the queue, and why GPUGRID cannot somehow reset this daily quota junk.

By now, my frustration has reached quite a level :-(

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46736 - Posted: 22 Mar 2017 | 10:16:56 UTC

Relax everone, we are where we are,I'm sure the admins are as frustrated as ourselves and are working to correct the situation.

On the bright side short WU's jus got a boost in computation.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46737 - Posted: 22 Mar 2017 | 10:27:01 UTC - in response to Message 46736.

short WU's jus got a boost in computation.

what does it help if they cannot be downloaded?

Loohi
Send message
Joined: 27 Aug 16
Posts: 16
Credit: 43,745,875
RAC: 0
Level
Val
Scientific publications
wat
Message 46738 - Posted: 22 Mar 2017 | 10:50:47 UTC - in response to Message 46737.


what does it help if they cannot be downloaded?


They can be downloaded, actually.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46739 - Posted: 22 Mar 2017 | 11:21:11 UTC - in response to Message 46738.
Last modified: 22 Mar 2017 | 11:22:26 UTC

They can be downloaded, actually.

NOT on my machines. There comes the same notice re "daily quota of 1 task" ... :-(

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46740 - Posted: 22 Mar 2017 | 11:28:33 UTC - in response to Message 46739.
Last modified: 22 Mar 2017 | 11:30:15 UTC

They can be downloaded, actually.

NOT on my machine. There comes the same notice re "daily quota of 1 task" ... :-(

The quota is applied per task type. You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated - I have two machines running them at the moment, because of that.

Here are the log entries from one of the affected machines:

22/03/2017 09:51:04 | GPUGRID | This computer has finished a daily quota of 1 tasks
22/03/2017 10:13:27 | GPUGRID | Scheduler request completed: got 2 new tasks
22/03/2017 10:13:27 | GPUGRID | No tasks are available for the applications you have selected
22/03/2017 10:13:27 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
22/03/2017 10:13:27 | GPUGRID | Your preferences allow tasks from applications other than those selected

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46741 - Posted: 22 Mar 2017 | 11:35:05 UTC - in response to Message 46739.
Last modified: 22 Mar 2017 | 11:37:36 UTC

They can be downloaded, actually.

NOT on my machines. There comes the same notice re "daily quota of 1 task" ... :-(


In addition to Richards response you have Long WU's running on three out of four of your machines. What more exactly do you want?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46742 - Posted: 22 Mar 2017 | 11:44:07 UTC - in response to Message 46741.

What is shown in my log is unfortunately wrong.

There is a total of 2 tasks running now. One on the slow GTX750Ti which was obviously not affacted the same way as the faster machines.
And one, to my surprise, on the GTX970.

The log, erronously shows 2 tasks on the PC with the two GTX980Ti, however, no tasks are being crunched there.

Then there is another PC with a GTX750Ti, which still shows the "quota of 1 task per day" notice.

It would of course be great if I could finally run tasks on the two GTX980ti's

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46743 - Posted: 22 Mar 2017 | 11:47:29 UTC

Further, the log shows that a

PABLO_contact_goal_KIX_CMYB-0-4-RND2705_5

was downloaded at 10:11 hrs UTC this morning, and also errored out after a few seconds.
Can these faulty WUs indeed not be eleminated from the queue?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46744 - Posted: 22 Mar 2017 | 12:15:53 UTC - in response to Message 46740.

You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated

that's what BOINC is showing me:

22/03/2017 13:12:42 | GPUGRID | No tasks are available for Short runs (2-3 hours on fastest card)
22/03/2017 13:12:42 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
22/03/2017 13:12:42 | GPUGRID | This computer has finished a daily quota of 1 tasks

So I doubt that could get short runs.
(your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened).

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46745 - Posted: 22 Mar 2017 | 12:22:34 UTC

To my surprise, the faulty / working ratio is much better than I've expected.
I did a test with my dummy host again, and only 18 of 48 workunits were faulty.
I've received some of the new (working) workunits on my alive hosts too, so the daily quota will be recovered in a couple of days.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46746 - Posted: 22 Mar 2017 | 12:24:56 UTC - in response to Message 46745.

... so the daily quota will be recovered in a couple of days.

still it's a shame that there is no other mechanism in place for cases like the present one :-(

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46747 - Posted: 22 Mar 2017 | 12:26:17 UTC - in response to Message 46744.

You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated

that's what BOINC is showing me:

22/03/2017 13:12:42 | GPUGRID | No tasks are available for Short runs (2-3 hours on fastest card)
22/03/2017 13:12:42 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
22/03/2017 13:12:42 | GPUGRID | This computer has finished a daily quota of 1 tasks

So I doubt that could get short runs.
(your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened).
The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46748 - Posted: 22 Mar 2017 | 12:50:53 UTC - in response to Message 46746.

... so the daily quota will be recovered in a couple of days.

still it's a shame that there is no other mechanism in place for cases like the present one :-(
You can't prepare a system to every abnormal situation. BTW you'll receive workunits while your daily quota is lower than its maximum. The only important factor is that a host should not receive many faulty workunits in a row, because it will "blacklist" that host for a day. This is a pretty good automatism to minimize the effects of a faulty host, as such a host would exhaust the queues in a very short time if there's nothing to limit the work assigned to a faulty host. Too bad that this generic error combined with this self-defense made all of our hosts blacklisted, but there's no defense of this self-defense. I've realized that we are this "device", which could make this project running in such regrettable situations.

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,646,134,111
RAC: 20,072,923
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46749 - Posted: 22 Mar 2017 | 12:56:02 UTC

When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46750 - Posted: 22 Mar 2017 | 13:12:20 UTC - in response to Message 46749.

When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)
Indeed. This should be a special one, with special design. I think of a crashed bug. :)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46751 - Posted: 22 Mar 2017 | 13:35:28 UTC - in response to Message 46747.

The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.

Sometimes you get a working long task, sometimes you get a faulty long task, sometimes you get a short task - it's very much the luck of the draw at the moment. I've had all three outcomes within the last hour.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46752 - Posted: 22 Mar 2017 | 13:53:18 UTC - in response to Message 46751.

[quote]... sometimes you get a faulty long task

this leads me to repeating my question: why were/are the faulty ones not eliminated from the queue?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46753 - Posted: 22 Mar 2017 | 14:13:09 UTC - in response to Message 46752.

why were/are the faulty ones not eliminated from the queue?

My guess - and it is only a guess - is that the currently-available staff are all biochemical researchers, rather than specialist database repairers. BOINC server code provides tools for researchers to submit jobs directly, but identifying faulty (and only faulty) workunits for cancellation is a tricky business. We've had cases in the past when batches of tasks have been cancelled en bloc, including tasks in the middle of an apparently viable run. That caused even more vociferous complaints (of wasted electricity) than the current forced diversion of BOINC resources to other backup projects.

Amateur meddling in technical matters (anything outside your personal professional skill) can cause more problems than it's worth. Stefan has owned up to making a mistake in preparing the workunit parameters: he has corrected that error, but he seems to have decided - wisely, in my opinion - not to risk dabbling in areas where he doesn't feel comfortable about his own level of expertise.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46754 - Posted: 22 Mar 2017 | 14:29:14 UTC

@Richard: what you are saying sounds logical

Ken Florian
Send message
Joined: 4 May 12
Posts: 56
Credit: 1,832,989,878
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46755 - Posted: 22 Mar 2017 | 20:24:35 UTC

Though I once posted some good numbers to the project, I've been away for awhile and lost track of how BOINC ought to work.

I still do not have new tasks after my own set of failed tasks.

Is there anything I need to do "clear my name" so that I get tasks?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46756 - Posted: 22 Mar 2017 | 22:50:18 UTC

I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit.

As to what to do about it - just allow/encourage your computer to request work once each day. Perhaps you will be lucky and get a good one at the next attempt, or you may end up with several more days' wait. It'll work out in the end.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46757 - Posted: 23 Mar 2017 | 8:12:20 UTC - in response to Message 46756.

I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit.
If there's the
ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
message in many of the previous task's stderr.txt output file, then it's a faulty task.
The one you've received is failed 4 times, from 3 different reasons (but none of them is the one above):

1st & 3rd:
<message> process exited with code 201 (0xc9, -55) </message> <stderr_txt> # Unable to initialise. Check permissions on /dev/nvidia* (err=100) </stderr_txt>

2nd (that's the most mysterious:)
<message> process exited with code 212 (0xd4, -44) </message> <stderr_txt> </stderr_txt>

4th:
<message> (unknown error) - exit code -80 (0xffffffb0) </message> <stderr_txt> ... # Access violation : progress made, try to restart called boinc_finish </stderr_txt>

BTW things are now back to normal (almost), some faulty workunits are still floating around.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 46760 - Posted: 23 Mar 2017 | 18:07:10 UTC

Has the problem been fixed?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2343
Credit: 16,206,655,749
RAC: 261,147
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46761 - Posted: 23 Mar 2017 | 18:54:14 UTC - in response to Message 46760.

Has the problem been fixed?
Yes.
There still could be some faulty workunits in the long queue, but those are not threatening the daily quota.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,084,353,479
RAC: 15,568,836
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46799 - Posted: 31 Mar 2017 | 10:54:38 UTC

These error units are starting to disappear from the tasks pages. Soon, they will be all gone, nothing more than a memory.


Good bye!!!!


Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46803 - Posted: 31 Mar 2017 | 17:32:53 UTC

Trouble is, I'm starting to ses a new bad batch, like

e1s2_ubiquitin_50ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_50_ubiquitin_1-0-1-RND7532

I've seen failures for each of contacts_20, contacts_50, and contacts_100

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46804 - Posted: 31 Mar 2017 | 17:54:16 UTC - in response to Message 46803.

I just got one an hour ago, that failed after two seconds.
e1s2_ubiquitin_20ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_20_ubiquitin_6-0-1-RND9359

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,832,016,430
RAC: 19,852,073
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46805 - Posted: 31 Mar 2017 | 21:30:23 UTC

e1s9_ubiquitin_100ns_8-ADRIA_FOLDGREED10_crystal_ss_contacts_100_ubiquitin_4-0-2-RND2702 is running OK, so they're not all bad.

Loohi
Send message
Joined: 27 Aug 16
Posts: 16
Credit: 43,745,875
RAC: 0
Level
Val
Scientific publications
wat
Message 46806 - Posted: 1 Apr 2017 | 3:58:45 UTC

Same here, 6 broken Adria WU out of 8, in 12 hours so far. Failing immediately.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46811 - Posted: 1 Apr 2017 | 23:59:56 UTC
Last modified: 2 Apr 2017 | 0:24:44 UTC

,,,and here too
02.04.2017 01:58:39 | GPUGRID | Started download of e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_8-0-psf_file

SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)

next WU
e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_8-0-2-RND0956_3

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46812 - Posted: 2 Apr 2017 | 2:17:59 UTC - in response to Message 46811.


SWAN : FATAL Unable to load module .mshake_kernel.cu. (999)



That's more likely to have been your hosts fault rather than the WU. Maybe due to power failure or hard shutdown.

The other one says: (simulation unstable) maybe due to overclocking.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46839 - Posted: 8 Apr 2017 | 8:03:25 UTC

During the past days, there are still WUs which fail after a few seconds.
Latest example:

e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_1-0-2-RND7346_8

ending after 4.38 seconds.

What I notice is that these are particularly the "...ubiquitin..." tasks.
Any explanation why this happens?

Loohi
Send message
Joined: 27 Aug 16
Posts: 16
Credit: 43,745,875
RAC: 0
Level
Val
Scientific publications
wat
Message 46865 - Posted: 14 Apr 2017 | 16:35:23 UTC - in response to Message 46839.

During the past days, there are still WUs which fail after a few seconds.
Latest example:

e1s17_ubiquitin_100ns_16-ADRIA_FOLDGREED90_crystal_ss_contacts_100_ubiquitin_1-0-2-RND7346_8

ending after 4.38 seconds.

What I notice is that these are particularly the "...ubiquitin..." tasks.
Any explanation why this happens?


This is still happening, just had 2 in a row

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,210,882,676
RAC: 29,470,062
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46866 - Posted: 14 Apr 2017 | 16:39:13 UTC - in response to Message 46865.

This is still happening, just had 2 in a row

yes, this new problem is being discussed here:
http://www.gpugrid.net/forum_thread.php?id=4545

Post to thread

Message boards : Number crunching : BAD PABLO_p53 WUs

//