Advanced search

Message boards : Number crunching : Phantom WUs

Author Message
Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12259 - Posted: 1 Sep 2009 | 20:15:03 UTC

I've ended up with two phantom WUs, and with one running dont want to detatch just yet. I am sole cruncher for both, so would be good if they could be reissued to prevent any delays. They are:

http://www.gpugrid.net/workunit.php?wuid=738974

and

http://www.gpugrid.net/workunit.php?wuid=740737

(currently crunching 742779)

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12281 - Posted: 2 Sep 2009 | 21:06:17 UTC - in response to Message 12259.

What do you mean by Phantom WUs? They're in the database and apparently they're on your system, as you're crunching them.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12309 - Posted: 3 Sep 2009 | 13:43:46 UTC - in response to Message 12281.
Last modified: 3 Sep 2009 | 14:29:12 UTC

Phantom WUs - they are a common occurance in BOINC due to its design. I get a few from time to time, most people do. Generally it gets ignored and the WU times out to be reallocated by the server, which is fine, but wastes much time - eg CPDN with timeouts of 12months + :)

This recently cropped up at Aqua, where the admins were not aware of the nature of Phantoms. Richard kindly helps out there as they are going through a steep learning curve now their apps are getting complex. He posted the extract below to explain to them, and I have copied it here, as its a good suscinct summary of the issue and what to do about it.

============= Richards Aqua Post - Phantoms ==========

Time for another little tutorial, I think.

This problem of "ghost", "phantom" or just plain "missing" tasks happens on all projects from time to time. The assumption is that a user's computer requests work: the server hears that request, selects a task, marks it as 'in progress' in the database, and sends a message back to the user with all the details. But for some reason the user's computer misses that message - there's a glitch on the internet, the computer is switched off or rebooted at the critical moment, or something like that. Unlike most BOINC communications, these messages don't seem to require an active acknowledgement, so if they get missed, they stay missed.

Straight out of the box (which as we know is the way boinc_admin and neo like to run things round here), that's the end of the story. So there will be an unknown number of ghosts in the database, and unless somebody reports them here so they can be exorcised manually, they'll hang around haunting the place until they reach their deadline (early October, for recent tasks). Then they'll get re-issued: not good for draining this batch of tasks out of the system. [insert by Zy: Aqua are currently running down the queue ready for a new design of WU]

Fortunately, as with most things, BOINC provides a solution in the standard code: all it requires is configuration. [Now where have I heard that before?].

The magic incantation is <resend_lost_results>1</resend_lost_results> in the server's config.xml file. As noted in the Project configuration documentation, this causes extra work for the database server, but I can't imagine the server is overloaded at the moment: it would a good moment to learn about this setting (and whether the AQUA server can cope with the extra load), so that the ghosts can be exorcised automatically by the machine, and not take up valuable staff time cancelling each one manually.

===============

In my case at the time of download I was having issues with a nasty piece of spyware that locked my whole system. Hence I fell into the classic scenario Richard described - you think I got them, in fact I didnt, they show in my task list, but do not appear in BAM at my end. Similar can happen if a user has a disk crash and looses held WUs on reformat - they are marked as held in the Db your end - but gone "poof" at the user end.

Most times however Phantoms occur due to the incomplete way the BOINC server deals with the RPC commands as Richard describes.

A Classic way to treat Phantoms, is to run down a Projects queue in BAM until its dry for a PC, then detach & reattach the PC. This fires off BOINC at your end to reallocate any uncrunched tasks, which of course will include the Phantom entries. Right now, I have some GPUGRID ones running, so dont want to detach/reattach as I would end up killing off a running WU.

Its this detach/reattach solution used by lots of BOINC'rs that deals with the issue - but can mask that the issue exists in the first place. Richards explanation shows how a Project can pro-actively deal with it, not just leave it to the Cruncher - as many of the latter will not bother, and just leave them to time out.

Regards
Zy

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12317 - Posted: 3 Sep 2009 | 16:14:14 UTC - in response to Message 12309.
Last modified: 3 Sep 2009 | 16:15:08 UTC

Sorry forgot the last bit .....

So, at the end of the day, whats in it for GPUGRID? Those phantom WUs - and I guarantee they will be there in unknown numbers if this has not been addressed before - will lurk until timeout. You think they are issued - they have not been. Here that means they stay there in [whoever's] task list until timeout - currently 5 days.

You want WUs back, ideally, in 24 hrs, there are lots in there that will never come back until timeout 5 days later. Fix this, and you will increase the rate WUs are returned because you stop the five day delay of zero crunching, and increase the number of available workpieces to scientists in a shorter space of time.

Bottom line - you'll get more done in less time.

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12333 - Posted: 3 Sep 2009 | 20:41:50 UTC

Thanks for the explanation! The underlying reason really sounds a bit stupid, though: no "hey server, I heard you, checksum's xxx" but on the other hand the 6.6 clients keep asking all projects for CPU and CUDA work, regardless of what the project offers. Almost schizophrenic.. :p

MrS
____________
Scanning for our furry friends since Jan 2002

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,884,536,851
RAC: 10,719,075
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12338 - Posted: 3 Sep 2009 | 21:37:47 UTC

The devil is in the detail - that "extra work for the database server".

Every time a BOINC client contacts a BOINC server, it sends - almost as an afterthought - a complete list of every 'task in progress' on the client. The process invoked by resend_lost_results involves checking that comprehensive list against the comprehensive list in the database of all tasks issued to the host.

For AQUA, that's ~3x3: for GPUGrid, ~6x6. Neither should cause much angst.

But for projects like SETI, with short CUDA tasks and generous deadlines, that can be several thousand x several thousand. The SETI servers are under stress at the best of times: resend_lost_results has been tried, and it breaks them.

So it's not unreasonable that the resend_lost_results code exists, but is disabled by default: server admins are at liberty to enable it as and when they are confident that their server hardware is up to the task.

It all goes to prove that RTFM applies to admins as much as punters!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12350 - Posted: 3 Sep 2009 | 22:26:38 UTC - in response to Message 12338.

Well, this checking does sound unpractical. But what about something like this:

- client: sends work request
- server: chooses and sends work, does not yet write anything back into database
- client: acknowledges that it's going to receive the files
- server: may write data back into database (WU issued)
- client: acknowledges that it got the files and checksums are OK
- server: sends one last hand-shake, now both know that nothing was lost
- server: may now write data back into database, if not done before or if another update would be needed

In case communication is lost the server would keep the process in memory for some time (not sure if he could or should actively try to contact the client in case of a timeout.. probably not) and the client would keep asking for some time ("I just asked for work, didn't you want to reply?"). That way the load on the database could probably be kept similar, phantoms could be avoided (server knows that WU got lost if it doesn't hear from client) and some more network traffic is generated. If many file transfers fail one would have to be careful though, as not to overload server main memory due to many requests / file being kept in flight. Could start to flush the oldest ones to disk / database at some point.

Am I assuming things to be too simple here? Or would the benefit outweight the cost (both, developmental and server ressources)?

MrS
____________
Scanning for our furry friends since Jan 2002

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,884,536,851
RAC: 10,719,075
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12351 - Posted: 3 Sep 2009 | 23:06:16 UTC - in response to Message 12350.

You'll need to run that one past David A or Rom W.

I was involved in a similar discussion at SETI about three years ago, and (partly) as a result, Rom wrote a much-quoted blog The evils of 'Returning Results Immediately': the point being the amount of time it takes the scheduler DB process to open the database and make the necessary transactions on each client contact, with the emphasis being on the 'opening'.

Your process involves three separate server database interactions: because of the asynchronous timing, that possibly involves three separate 'open' phases, which would fall foul of Rom's concern about overheads.

Since then, I've heard much less about database loading as the primary BOINC constraint: since then it's mostly been disk loading, network throughput and (most recently) log volume latency. Maybe modern hardware or improvements in MySQL have supplanted the DB overhead, and your scheme could be implemented now. But you're right: it's anomalous that this crucial transaction is one of the few that isn't protected by a robust transactional handshake.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12355 - Posted: 4 Sep 2009 | 11:48:19 UTC - in response to Message 12351.
Last modified: 4 Sep 2009 | 11:54:55 UTC

Guys - just had a thought [ happens from time to time rofl:) ]

I well appreciate the transactional process you have described above, and possible solutions - the latter being predicated on making sure the transaction goes ahead correctly in the first place, a natural and proper thought process.

How about attacking it from another angle - and maybe - this could be implimented at Project level outside the BOINC server code. Its not a "Solution", more a workaround, but I acknowledge the problem of getting the server code to be all things to all men when so many different needs from disparate Projects are being served by the same Infrastructure.

In the classic transaction solutions being discussed, the critical issue always comes back to processing the information on the server database and the loading it causes. Flip the principle, put the db loading on the distributed client. Pick a quiet(ish) time, all Projects have a time of day where there is usually a downturn in activity, and send off a list of [alledgedly] held WUs to Clients, then get BAM to search and respond if it agrees. Whilst there will be an element of db activity at the server end compiling the queeries, its far far less than [at SETI for example thousands x thousands scenario] running what is in effect a Many to Many Join SQL queery, its "just" a straightforward "Select From .... For [user]" far less intensive on the server, even acknowledging issues from Projects like SETI with huge user numbers.

That way the bottom line problem of validation and db search loading is handed to the Client and far less load on the server ? As you are aware, I dont have the lower level background you guys do .... so there is probably a "no no" lurking in there somewhere :) - however I cant get the thought out of my skull, so I thought I would let it go and see if it flys.

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12363 - Posted: 4 Sep 2009 | 21:02:13 UTC

Zydor, it's always a good idea to try and think things in the reverse way or opposite direction. But I think I've got something to spoil your suggestion ;)

The server can't decide to send anything to the clients now. He can't know their current IPs and which of them are connected to the net at all (think of a poor modem guy). What you can do is ask the clients which WUs they currently have, if they contact you. Which just what they're already doing, as described in Richards first post.

Richard, thanks for the further explanations. I think I'll leave it at that - as a participant I do not really have to worry about the phantoms, as they harm the server and the project, but not me. And I'd rather have the devs put their focus onto the co-processor, debt and scheduler issues.

MrS
____________
Scanning for our furry friends since Jan 2002

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,884,536,851
RAC: 10,719,075
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12367 - Posted: 4 Sep 2009 | 21:29:38 UTC - in response to Message 12363.

The servers know the client IP address at last contact (it's shown on your host details page, though not visible if you're not logged in on the associated user account) - and very useful that is too, for tracing stolen laptops.

But it's not enough. I have seven - belay that, six (I lost a fan on my faithful Celeron 400MHz MMX, and retired it after 10 years) - hosts behind a single DSL router, so all with the same IP address. I'm not opening the firewall on that router for anybody, not even BOINC. So servers "calling out" is a complete no-no.

But all is not lost. This project already uses the "call me again in a few hours, to confirm you're still alive" protocol. Normally, that just results in NOP, or occasionally "BTW, I've finished a task - you may as well have the report while we're chatting". That would be a good opportunity fo Zy's extension: "Things are quiet round here at the moment. Let's compare notes about what we each think you're up to at the moment."

Come to think of it, that's exactly what resend_lost_results does at the moment: the trouble is, it's either done on every single contact, crashing the server, or NEVER. Maybe if, when RLR was enabled, it did a full check every (randomised) ~nth scheduler contact (n configurable from 100% to 0.01%, to suit project needs and capabilities), there would be a periodic 'catchup' without massive overheads?

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 12379 - Posted: 5 Sep 2009 | 9:05:46 UTC - in response to Message 12367.

Ah, occasionally I've seen cases where scheduler requests were marked as "requested by server" - that must be these ones. This sounds like a very god idea. One could also try to couple it automatically with server and DB load.. if such performance indicators are accessible without much overhead. In case of low load this check can be done more often, if the server already screams for a break then just leave it alone.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 12473 - Posted: 12 Sep 2009 | 22:09:50 UTC

One of the fundamental problems is that though the design intent is for small projects with low budgets much of the implementation revolves about SETI@Home's issues and constraints.

This is just one more example of how the need to shave corners for SaH causes issues for other projects.

The defaults should be for robust protocols with options for turning down and turning off options if and as they might be needed for performance reasons.

THough I will note that much of the performance problems of the BOINC database design are self inflicted but that all suggestions to rectify those issues are routinely ignored.

Post to thread

Message boards : Number crunching : Phantom WUs

//