Scheduling configurations updated

Message boards : Number crunching : Scheduling configurations updated

Author	Message
ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 7135 - Posted: 3 Mar 2009 \| 16:32:35 UTC
	We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts". Let us know if you see any problem with the scheduler. thanks, ignasi
	ID: 7135 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 7161 - Posted: 4 Mar 2009 \| 10:23:57 UTC - in response to Message 7135.
	We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts". Let us know if you see any problem with the scheduler. Well, it still won't let me queue up 87 days worth of work ... :)
	ID: 7161 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 7213 - Posted: 5 Mar 2009 \| 9:28:22 UTC - in response to Message 7161.
	So far it seems to work. To let people know, having an host with an error rate less than 5% and returning WUs withing 24 hours in average classifies your host as RELIABLE. This is approximatively 25% all of hosts. The advantage of these hosts is that they will receive all WUs available, so they are unlikely to be left without work. While other hosts receive only a subset WUs for which a failure is not a problem. Rationale: For us it is very important when we send a batch of WUs, to obtain all of them back. A single one missing and we need to wait to perform the analysis. gdf
	ID: 7213 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 7223 - Posted: 5 Mar 2009 \| 21:49:02 UTC
	Can this be flagged in the computer data page? It should be one of the public data items ... not only will it let us know how our machines are doing, it can help with debugging them ... This should be rolled back into the UCB baseline also if you get it working ... all the world wants to know ... :)
	ID: 7223 \| Rating: 0 \| rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 7230 - Posted: 6 Mar 2009 \| 9:24:59 UTC
	Does this have to be withing 24 hours because i think slower machines can be more reliable then faster ones in my opinion. On my machine i never had an computation error, other then mistakes made by detaching boinc client without me wanting that.
	ID: 7230 \| Rating: 0 \| rate: / Reply Quote

The Uncle B's Send message Joined: 16 Jan 09 Posts: 1 Credit: 4,893,709 RAC: 0 Level Scientific publications	Message 8021 - Posted: 31 Mar 2009 \| 20:26:33 UTC - in response to Message 7230. Last modified: 31 Mar 2009 \| 20:26:58 UTC
	SO, is this error message assosiated with this...it's for a brand new machine, with no history!: 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: No work sent 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (reached daily quota of 8 results) 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (Project has no jobs available) ____________
	ID: 8021 \| Rating: 0 \| rate: / Reply Quote

Stefan Ledwina Send message Joined: 16 Jul 07 Posts: 464 Credit: 237,700,010 RAC: 4,775,275 Level Scientific publications	Message 8023 - Posted: 31 Mar 2009 \| 20:54:41 UTC - in response to Message 8021.
	SO, is this error message assosiated with this...it's for a brand new machine, with no history!: 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: No work sent 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (reached daily quota of 8 results) 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (Project has no jobs available) Do you have a link to the host? Is it this one? http://www.gpugrid.net/results.php?hostid=31246? If it is that host - well it only has computation errors... ____________ pixelicious.at - my little photoblog
	ID: 8023 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 8478 - Posted: 15 Apr 2009 \| 22:48:40 UTC Last modified: 15 Apr 2009 \| 22:48:53 UTC
	"error rate less than 5%" over what period of time or over what number of results? I had problems (self induced) when I first started and it would be a shame for my PC (i7 + GTX295) which is crunching 24/7 and I normally keep the queue at .75 days to not be considered "reliable". Steve
	ID: 8478 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 8493 - Posted: 16 Apr 2009 \| 12:23:04 UTC - in response to Message 8478.
	This is a standard boinc averaging of results. I am not sure over how long they compute the averages. gdf
	ID: 8493 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 8510 - Posted: 16 Apr 2009 \| 22:00:06 UTC - in response to Message 8493.
	I will try to look around before I post next time :-) AdaptiveReplication BOINC maintains an estimate E(H) of host H's recent error rate. This is maintained as follows: It is initialized to 0.1 It is multiplied by 0.95 when H reports a correct (replicated) result. It is incremented by 0.1 when H reports an incorrect (replicated) result. Thus, it takes a long time to earn a good reputation and a short time to lose it.
	ID: 8510 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 8511 - Posted: 16 Apr 2009 \| 22:16:41 UTC - in response to Message 8510. Last modified: 16 Apr 2009 \| 22:17:27 UTC
	If anyone is interested I just built a quicky spreadsheet that I copied all of my completed tasks into and added the reliability formula ... 1. Paste the tasks table header starting at A1 2. Clean these up so you can start pasting the actual results in A2 3. in K1 enter .1 4. enter the following forumla in K2 =IF(F2="Success",K1*0.95,IF(F2="Client error",K1-0.1,K1)) 5. Copy K2 down through all of the rows that have data. I just make it at 4.95% :-) Steve
	ID: 8511 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 8514 - Posted: 17 Apr 2009 \| 8:13:20 UTC - in response to Message 8511.
	We will increase to just below 0.1, probably 0.09
	ID: 8514 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 8516 - Posted: 17 Apr 2009 \| 8:39:15 UTC - in response to Message 8514. Last modified: 17 Apr 2009 \| 8:43:25 UTC
	A minor issue in the great scheme of things, what will happen if there is a repeat of the bad batch of WU issues on 21 Mar? On that day there was a bad batch let loose, it was swiftly detected and recalled - held WUs from it were marked "cancelled by server". However those that got totalled prior to recall were marked "Client Error, Compute Error". It happened in seconds, and those with a queue of circa 3/4/5 or more would have all of them so marked in seconds flat as the GPU worked through the held ones. In my case it was only two before the recall. Under the new rules that would mean I would have to submitt 28 or so "good" ones to be "reliable" once more. In my case, no real biggie as that would only be circa 9 days or so, not exactly "the end of life on Planet Earth as we know it". However for those with a bigger queue, it could be much longer - especially if they did a refill of another 5 or 6 before recall, and they got totalled as well - life on Planet Earth could get a little shakey :) If its not already factored in the coding for this, it should be reviewed to have a look at the issue, possibly by some kind of automatic labelling of a bad batch on recall, such that any search/calculation for the "reliability" factor ignores WUs from the batch that went bad prior to recall. Its a rare event for sure, but has potential for grief if it occured with mega crunchers and larger queues. Regards Zy
	ID: 8516 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8530 - Posted: 17 Apr 2009 \| 15:55:27 UTC
	Zydor, The "good" news is that the queues are small regardless in that I have 4 in flight and 4 in queue and that is the most I can have. The bad news is, and you are right, is that if there is a poisoned batch I could get a slew of them in short order and with 4 GPUs munching on them go through my daily quota in a very short time. (my daily norm is 15-16 a day on that system) In the past if the staff were watching and the participant noticed they did do some occasional resets of peoples error rate so that they could get work again that day ... but that is hit or miss...
	ID: 8530 \| Rating: 0 \| rate: / Reply Quote

TomaszPawel Send message Joined: 18 Aug 08 Posts: 121 Credit: 59,836,411 RAC: 0 Level Scientific publications	Message 8802 - Posted: 23 Apr 2009 \| 20:56:00 UTC - in response to Message 7135. Last modified: 23 Apr 2009 \| 21:04:40 UTC
	Hi! I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home: see this post and My problems: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841
	ID: 8802 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8805 - Posted: 23 Apr 2009 \| 21:23:55 UTC - in response to Message 8802.
	Hi! I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home: see this post and My problems: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841 No it isn't ... :) Hi ... I have been reporting your troubles and we are working on it. I am going to post a note over there ...
	ID: 8805 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Scheduling configurations updated

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 7135 - Posted: 3 Mar 2009 \| 16:32:35 UTC
	We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts". Let us know if you see any problem with the scheduler. thanks, ignasi
	ID: 7135 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 7161 - Posted: 4 Mar 2009 \| 10:23:57 UTC - in response to Message 7135.
	We have modified some scheduling parameters to send retries for failed WUs to "reliable hosts". Let us know if you see any problem with the scheduler. Well, it still won't let me queue up 87 days worth of work ... :)
	ID: 7161 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 7213 - Posted: 5 Mar 2009 \| 9:28:22 UTC - in response to Message 7161.
	So far it seems to work. To let people know, having an host with an error rate less than 5% and returning WUs withing 24 hours in average classifies your host as RELIABLE. This is approximatively 25% all of hosts. The advantage of these hosts is that they will receive all WUs available, so they are unlikely to be left without work. While other hosts receive only a subset WUs for which a failure is not a problem. Rationale: For us it is very important when we send a batch of WUs, to obtain all of them back. A single one missing and we need to wait to perform the analysis. gdf
	ID: 7213 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 7223 - Posted: 5 Mar 2009 \| 21:49:02 UTC
	Can this be flagged in the computer data page? It should be one of the public data items ... not only will it let us know how our machines are doing, it can help with debugging them ... This should be rolled back into the UCB baseline also if you get it working ... all the world wants to know ... :)
	ID: 7223 \| Rating: 0 \| rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 7230 - Posted: 6 Mar 2009 \| 9:24:59 UTC
	Does this have to be withing 24 hours because i think slower machines can be more reliable then faster ones in my opinion. On my machine i never had an computation error, other then mistakes made by detaching boinc client without me wanting that.
	ID: 7230 \| Rating: 0 \| rate: / Reply Quote

The Uncle B's Send message Joined: 16 Jan 09 Posts: 1 Credit: 4,893,709 RAC: 0 Level Scientific publications	Message 8021 - Posted: 31 Mar 2009 \| 20:26:33 UTC - in response to Message 7230. Last modified: 31 Mar 2009 \| 20:26:58 UTC
	SO, is this error message assosiated with this...it's for a brand new machine, with no history!: 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: No work sent 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (reached daily quota of 8 results) 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (Project has no jobs available) ____________
	ID: 8021 \| Rating: 0 \| rate: / Reply Quote

Stefan Ledwina Send message Joined: 16 Jul 07 Posts: 464 Credit: 237,700,010 RAC: 4,775,275 Level Scientific publications	Message 8023 - Posted: 31 Mar 2009 \| 20:54:41 UTC - in response to Message 8021.
	SO, is this error message assosiated with this...it's for a brand new machine, with no history!: 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: No work sent 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (reached daily quota of 8 results) 3/31/2009 12:23:29 PM\|GPUGRID\|Message from server: (Project has no jobs available) Do you have a link to the host? Is it this one? http://www.gpugrid.net/results.php?hostid=31246? If it is that host - well it only has computation errors... ____________ pixelicious.at - my little photoblog
	ID: 8023 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 8478 - Posted: 15 Apr 2009 \| 22:48:40 UTC Last modified: 15 Apr 2009 \| 22:48:53 UTC
	"error rate less than 5%" over what period of time or over what number of results? I had problems (self induced) when I first started and it would be a shame for my PC (i7 + GTX295) which is crunching 24/7 and I normally keep the queue at .75 days to not be considered "reliable". Steve
	ID: 8478 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 8493 - Posted: 16 Apr 2009 \| 12:23:04 UTC - in response to Message 8478.
	This is a standard boinc averaging of results. I am not sure over how long they compute the averages. gdf
	ID: 8493 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 8510 - Posted: 16 Apr 2009 \| 22:00:06 UTC - in response to Message 8493.
	I will try to look around before I post next time :-) AdaptiveReplication BOINC maintains an estimate E(H) of host H's recent error rate. This is maintained as follows: It is initialized to 0.1 It is multiplied by 0.95 when H reports a correct (replicated) result. It is incremented by 0.1 when H reports an incorrect (replicated) result. Thus, it takes a long time to earn a good reputation and a short time to lose it.
	ID: 8510 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 8511 - Posted: 16 Apr 2009 \| 22:16:41 UTC - in response to Message 8510. Last modified: 16 Apr 2009 \| 22:17:27 UTC
	If anyone is interested I just built a quicky spreadsheet that I copied all of my completed tasks into and added the reliability formula ... 1. Paste the tasks table header starting at A1 2. Clean these up so you can start pasting the actual results in A2 3. in K1 enter .1 4. enter the following forumla in K2 =IF(F2="Success",K1*0.95,IF(F2="Client error",K1-0.1,K1)) 5. Copy K2 down through all of the rows that have data. I just make it at 4.95% :-) Steve
	ID: 8511 \| Rating: 0 \| rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level Scientific publications	Message 8514 - Posted: 17 Apr 2009 \| 8:13:20 UTC - in response to Message 8511.
	We will increase to just below 0.1, probably 0.09
	ID: 8514 \| Rating: 0 \| rate: / Reply Quote

Zydor Send message Joined: 8 Feb 09 Posts: 252 Credit: 1,309,451 RAC: 0 Level Scientific publications	Message 8516 - Posted: 17 Apr 2009 \| 8:39:15 UTC - in response to Message 8514. Last modified: 17 Apr 2009 \| 8:43:25 UTC
	A minor issue in the great scheme of things, what will happen if there is a repeat of the bad batch of WU issues on 21 Mar? On that day there was a bad batch let loose, it was swiftly detected and recalled - held WUs from it were marked "cancelled by server". However those that got totalled prior to recall were marked "Client Error, Compute Error". It happened in seconds, and those with a queue of circa 3/4/5 or more would have all of them so marked in seconds flat as the GPU worked through the held ones. In my case it was only two before the recall. Under the new rules that would mean I would have to submitt 28 or so "good" ones to be "reliable" once more. In my case, no real biggie as that would only be circa 9 days or so, not exactly "the end of life on Planet Earth as we know it". However for those with a bigger queue, it could be much longer - especially if they did a refill of another 5 or 6 before recall, and they got totalled as well - life on Planet Earth could get a little shakey :) If its not already factored in the coding for this, it should be reviewed to have a look at the issue, possibly by some kind of automatic labelling of a bad batch on recall, such that any search/calculation for the "reliability" factor ignores WUs from the batch that went bad prior to recall. Its a rare event for sure, but has potential for grief if it occured with mega crunchers and larger queues. Regards Zy
	ID: 8516 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8530 - Posted: 17 Apr 2009 \| 15:55:27 UTC
	Zydor, The "good" news is that the queues are small regardless in that I have 4 in flight and 4 in queue and that is the most I can have. The bad news is, and you are right, is that if there is a poisoned batch I could get a slew of them in short order and with 4 GPUs munching on them go through my daily quota in a very short time. (my daily norm is 15-16 a day on that system) In the past if the staff were watching and the participant noticed they did do some occasional resets of peoples error rate so that they could get work again that day ... but that is hit or miss...
	ID: 8530 \| Rating: 0 \| rate: / Reply Quote

TomaszPawel Send message Joined: 18 Aug 08 Posts: 121 Credit: 59,836,411 RAC: 0 Level Scientific publications	Message 8802 - Posted: 23 Apr 2009 \| 20:56:00 UTC - in response to Message 7135. Last modified: 23 Apr 2009 \| 21:04:40 UTC
	Hi! I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home: see this post and My problems: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841
	ID: 8802 \| Rating: 0 \| rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 8805 - Posted: 23 Apr 2009 \| 21:23:55 UTC - in response to Message 8802.
	Hi! I am wonder if GPUGRID scheduling is connected somehow with other project, eg. Rosetta@home: see this post and My problems: http://boinc.bakerlab.org/rosetta/forum_thread.php?id=4841 No it isn't ... :) Hi ... I have been reporting your troubles and we are working on it. I am going to post a note over there ...
	ID: 8805 \| Rating: 0 \| rate: / Reply Quote