Advanced search

Message boards : Number crunching : Scheduling configurations updated

Author Message
ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 7135 - Posted: 3 Mar 2009 | 16:32:35 UTC

We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts".
Let us know if you see any problem with the scheduler.

thanks,
ignasi

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 7161 - Posted: 4 Mar 2009 | 10:23:57 UTC - in response to Message 7135.

We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts".
Let us know if you see any problem with the scheduler.


Well, it still won't let me queue up 87 days worth of work ... :)

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 7213 - Posted: 5 Mar 2009 | 9:28:22 UTC - in response to Message 7161.

So far it seems to work.

To let people know, having an host with an error rate less than 5% and returning WUs withing 24 hours in average classifies your host as RELIABLE. This is approximatively 25% all of hosts.

The advantage of these hosts is that they will receive all WUs available, so they are unlikely to be left without work. While other hosts receive only a subset WUs for which a failure is not a problem.

Rationale: For us it is very important when we send a batch of WUs, to obtain all of them back. A single one missing and we need to wait to perform the analysis.

gdf

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 7223 - Posted: 5 Mar 2009 | 21:49:02 UTC

Can this be flagged in the computer data page? It should be one of the public data items ... not only will it let us know how our machines are doing, it can help with debugging them ...

This should be rolled back into the UCB baseline also if you get it working ... all the world wants to know ... :)

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 7230 - Posted: 6 Mar 2009 | 9:24:59 UTC

Does this have to be withing 24 hours because i think slower machines can be more reliable then faster ones in my opinion.
On my machine i never had an computation error, other then mistakes made by detaching boinc client without me wanting that.

The Uncle B's
Send message
Joined: 16 Jan 09
Posts: 1
Credit: 4,893,709
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwatwat
Message 8021 - Posted: 31 Mar 2009 | 20:26:33 UTC - in response to Message 7230.
Last modified: 31 Mar 2009 | 20:26:58 UTC

SO, is this error message assosiated with this...it's for a brand new machine, with no history!:

3/31/2009 12:23:29 PM|GPUGRID|Message from server: No work sent
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (reached daily quota of 8 results)
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (Project has no jobs available)
____________

Profile Stefan Ledwina
Avatar
Send message
Joined: 16 Jul 07
Posts: 464
Credit: 237,700,010
RAC: 4,775,275
Level
Leu
Scientific publications
watwatwatwatwatwatwatwat
Message 8023 - Posted: 31 Mar 2009 | 20:54:41 UTC - in response to Message 8021.

SO, is this error message assosiated with this...it's for a brand new machine, with no history!:

3/31/2009 12:23:29 PM|GPUGRID|Message from server: No work sent
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (reached daily quota of 8 results)
3/31/2009 12:23:29 PM|GPUGRID|Message from server: (Project has no jobs available)


Do you have a link to the host? Is it this one? http://www.gpugrid.net/results.php?hostid=31246? If it is that host - well it only has computation errors...
____________

pixelicious.at - my little photoblog

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8478 - Posted: 15 Apr 2009 | 22:48:40 UTC
Last modified: 15 Apr 2009 | 22:48:53 UTC

"error rate less than 5%"
over what period of time or over what number of results? I had problems (self induced) when I first started and it would be a shame for my PC (i7 + GTX295) which is crunching 24/7 and I normally keep the queue at .75 days to not be considered "reliable".

Steve

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 8493 - Posted: 16 Apr 2009 | 12:23:04 UTC - in response to Message 8478.

This is a standard boinc averaging of results.
I am not sure over how long they compute the averages.

gdf

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8510 - Posted: 16 Apr 2009 | 22:00:06 UTC - in response to Message 8493.

I will try to look around before I post next time :-)

AdaptiveReplication

BOINC maintains an estimate E(H) of host H's recent error rate. This is maintained as follows:

It is initialized to 0.1
It is multiplied by 0.95 when H reports a correct (replicated) result.
It is incremented by 0.1 when H reports an incorrect (replicated) result.
Thus, it takes a long time to earn a good reputation and a short time to lose it.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8511 - Posted: 16 Apr 2009 | 22:16:41 UTC - in response to Message 8510.
Last modified: 16 Apr 2009 | 22:17:27 UTC

If anyone is interested I just built a quicky spreadsheet that I copied all of my completed tasks into and added the *reliability* formula ...


1. Paste the tasks table header starting at A1
2. Clean these up so you can start pasting the actual results in A2
3. in K1 enter .1
4. enter the following forumla in K2
=IF(F2="Success",K1*0.95,IF(F2="Client error",K1-0.1,K1))
5. Copy K2 down through all of the rows that have data.

I just make it at 4.95% :-)

Steve

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 8514 - Posted: 17 Apr 2009 | 8:13:20 UTC - in response to Message 8511.

We will increase to just below 0.1, probably 0.09

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8516 - Posted: 17 Apr 2009 | 8:39:15 UTC - in response to Message 8514.
Last modified: 17 Apr 2009 | 8:43:25 UTC

A minor issue in the great scheme of things, what will happen if there is a repeat of the bad batch of WU issues on 21 Mar? On that day there was a bad batch let loose, it was swiftly detected and recalled - held WUs from it were marked "cancelled by server".

However those that got totalled prior to recall were marked "Client Error, Compute Error". It happened in seconds, and those with a queue of circa 3/4/5 or more would have all of them so marked in seconds flat as the GPU worked through the held ones. In my case it was only two before the recall. Under the new rules that would mean I would have to submitt 28 or so "good" ones to be "reliable" once more. In my case, no real biggie as that would only be circa 9 days or so, not exactly "the end of life on Planet Earth as we know it".

However for those with a bigger queue, it could be much longer - especially if they did a refill of another 5 or 6 before recall, and they got totalled as well - life on Planet Earth could get a little shakey :)

If its not already factored in the coding for this, it should be reviewed to have a look at the issue, possibly by some kind of automatic labelling of a bad batch on recall, such that any search/calculation for the "reliability" factor ignores WUs from the batch that went bad prior to recall.

Its a rare event for sure, but has potential for grief if it occured with mega crunchers and larger queues.

Regards
Zy

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8530 - Posted: 17 Apr 2009 | 15:55:27 UTC

Zydor,

The "good" news is that the queues are small regardless in that I have 4 in flight and 4 in queue and that is the most I can have. The bad news is, and you are right, is that if there is a poisoned batch I could get a slew of them in short order and with 4 GPUs munching on them go through my daily quota in a very short time. (my daily norm is 15-16 a day on that system)

In the past if the staff were watching and the participant noticed they did do some occasional resets of peoples error rate so that they could get work again that day ... but that is hit or miss...

TomaszPawel
Send message
Joined: 18 Aug 08
Posts: 121
Credit: 59,836,411
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8802 - Posted: 23 Apr 2009 | 20:56:00 UTC - in response to Message 7135.
Last modified: 23 Apr 2009 | 21:04:40 UTC

Hi!

I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home:

see this post and My problems:

http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8805 - Posted: 23 Apr 2009 | 21:23:55 UTC - in response to Message 8802.

Hi!

I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home:

see this post and My problems:

http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841

No it isn't ... :)

Hi ...

I have been reporting your troubles and we are working on it. I am going to post a note over there ...

Post to thread

Message boards : Number crunching : Scheduling configurations updated

//