Advanced search

Message boards : Graphics cards (GPUs) : can't get more than a single WU?

Author Message
Dj Ninja
Send message
Joined: 27 Apr 09
Posts: 3
Credit: 34,357,970
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9749 - Posted: 14 May 2009 | 12:06:58 UTC

hi there!

i have set up a small machine to test gpu-computing. it's based on a slow sempron processor and old power-saving mainboard and has nothing more to do than keeping the G92 running.

unfortunately the used sempron has one core only and so this machine can't hold more than one WU. this sucks because if you have to upload a 28 megabyte result file the G92 is sleeping for some 30 minutes. buffering work for the case that your projekt sometimes may be down for maintenance is also impossible.

why is there a one-wu-per-cpu limit and why is it there when CPUs have nothing to do with GPUs. when i use two or more GPUs in one machine, a single-core-prozessor should always be strong enough to support at least two GPUs... so a per-cpu-limit of 1 makes no sense for me. in my opinion such a low limit should be GPU-based (not CPU) and set at least one more than the number of GPUs in the system.

greetings from berlin, germany.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9781 - Posted: 14 May 2009 | 20:42:42 UTC - in response to Message 9749.

a per-cpu-limit of 1 makes no sense for me


The project agrees but so far has not been able to tell the BOINC server software about this.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9793 - Posted: 15 May 2009 | 9:13:53 UTC - in response to Message 9781.

a per-cpu-limit of 1 makes no sense for me


The project agrees but so far has not been able to tell the BOINC server software about this.

MrS

Do you mean that standing in front of the server and waggling your finer did not work?

What a shock!

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9810 - Posted: 15 May 2009 | 20:23:50 UTC - in response to Message 9793.

As far as I'm informed they also tried excessive shouting, to no avail. Scandalous!

MrS
____________
Scanning for our furry friends since Jan 2002

Dj Ninja
Send message
Joined: 27 Apr 09
Posts: 3
Credit: 34,357,970
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9822 - Posted: 16 May 2009 | 2:58:33 UTC

can't they set a per-cpu-limit of at least two WUs?

this won't fill the need for SLI-powered machines but should help a little. i'm planning to build a crunching machine using two G92 cards, but i don't want to spend a quad-core processor if the CPU load is nearly zero. such a machine will need three or four workunits to run uninterrupted even when uploading these huge result files (i had a over-50-meg-file Oo, if files this big are normal it needs some time to upload).

my suggestion is a try using a per-cpu-limit of two and a longer delay between scheduler requests when the client is not getting new work (because the limit has been reached). this delay is very short at this time, resulting in a high number of unnecessary scheduler requests and server/bandwith load.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9831 - Posted: 16 May 2009 | 9:40:09 UTC - in response to Message 9822.

This exponential timeout is a BOINC standard setting, so it would likely have to be changed by Berkeley. I think it does not quite make sense: if you don't get WUs and after a second request 1 min later you still don't get work, it's very likely that you won't get work anytime soon. However, the timeouts shouldn't reach obscene values of >24h (happens), especially when the machine is dry. Good timeout values could be: 1min, 5 min, 10 min, 30 min, and from there a constant 1h. Of course it would even better if the BOINC client would understand why the server is not issuing new work: if it says reached the 24h limit the client should know when it makes sense to ask again, of when the maximum number of concurrent WUs has been reached, the client should know that it has to report a finished WU to get new ones. That would save us quite some traffic, headaches and downtime.

Paul, would you like to suggest this to the alpha mailing list? :)

Regarding your other (very valid) question: historically it has been the case that the BOINC client was not smart enough to know how much GPU work it's going to do in a certain amount of time, i.e. how much work it needs to keep the cache filled but not overloaded. This was especially bad for user with a quad core and slower cards: they'd alsways get 4 WUs and had to abort 3 or 4 of them manually, if they did micromanagement. So the limit was set to 1 WU per cpu, as most heavy-duty crunchers ahve quads anyway and reduce the hassle created by overloading machines.
Recent BOINC versions have a better understanding of the fact that cpus are not equal GPUs, but it's far from perfect. So I'm not sure.. would it work now? Could we tell a client >= 6.6.28 "keep a 1 day cache" and see this obeyed, regardless of the ressource share (which mixes cpu and gpu), GPU speed and possible GPU mixes?

I'm a little sceptical. And the 6.6.x clients still have other issues, some of which are quite uncomfortable. And I know older clients would badly overload slower GPUs, if the ressource share isn't properly set. Which may very well be a cure worse than the problem.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9838 - Posted: 16 May 2009 | 11:17:06 UTC - in response to Message 9831.

Paul, would you like to suggest this to the alpha mailing list? :)


Um, well, yes and no ... :)

Firstly none of my suggestions seem to be taken too kindly... but more importantly we are mixing metaphors here and mechanisms. I am a little scatter-minded at the moment so bear with me and ask questions if what I say does not make sense.

Firstly the radical changes made in the 6.6.x series are not even close to being debugged. The good news is that at least for some long standing problems we are finding out what they are and pinpointing some causes. For example, just recently I found an issue that has been long-standing which had been reactivated by a new project DD which builds a huge directory structure in the slot when running the task, but at the end it has to delete 4,000 files which can cause a long pause for the BOINC Client which then causes the tasks they are running to fold up and die.

So, progress of sorts. But is is slow ...

As to work fetch and scheduling I will admit that we have focused on the resource scheduling issues. as they seem more critical. There are still issues. Like the one above, the tasks die silently and then are brought back to life silently ... the only time you know of trouble is when you see a lot of them dying with no heartbeat messages or too many restarts...

And the long running task issue. Another sore point and I don't know that it has been solved for sure. We think we have it ... but that is not at all clear ... I think I have seen it again ... in 6.6.25 or 28 ... RIchard has seen some other CUDA scheduling issues that I have not yet tried to dig into...

Ok, all that is prelude to the question you asked, what about the exponential back off... and work fetch ... well, on the list ... there are a number of issues with work fetch not the least of which is the poor mechanisms particularly on projects that have resource limits on the number of tasks that you can get and what to do when those limits are hit or when the work is not steady. You can see more bad example of what happens with this on MW.

I suggested we think about another mechanism than resource share to little effect ... only time will tell if I can make progress ...

However the attitude at UCB is likely to be if you don't get work, exp. back-off is a good way to go and so what if you run dry ...

I guess what I am trying to say is that this is kinda on my list guys ... but Sisyphus had it easy ... all he had to do was push some rocks ... he never had to deal with UCB ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9851 - Posted: 16 May 2009 | 12:53:25 UTC - in response to Message 9838.

You're right, they have more pressing issues at hand, so any suggestions like "just change the back-off formula" will likely be ignored.

I think they're wasting their time trying to debug 6.6.x. The system is fundamentally flawed, they need a clean differentiation between different co-processors. Otherwise all they can do is trying to catch strange errors which are "impossible". But whom am I telling that ;)

I suggested we think about another mechanism than resource share


I think it may be easy: set ressource share per computing ressource. If a project supports muliple ones, then let users set the share for each one individually. And drop the concept of debts, or at least make it such a low priority that it doesn't cause the scheduler to make strange moves like running dry. Really, debts shouldn't be that important!

MrS
____________
Scanning for our furry friends since Jan 2002

Dj Ninja
Send message
Joined: 27 Apr 09
Posts: 3
Credit: 34,357,970
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9852 - Posted: 16 May 2009 | 12:56:41 UTC
Last modified: 16 May 2009 | 13:00:25 UTC

maybe someone has an phenom or intel quad for me... not really nice to keep such options "set-by-hardware", especially non-GPU-related hardware. what if someone has a GX2 graphics card and a single-core processor? don't know if yours can do, but many projects are using per-host limits. can't you set a per-host-limit of approx. 4-6 WUs? this should be enough (and not too much) for every machine except "over-the-top" machines running something like four GX2 cards...

rectilinear crossing numbers is using a per-host-limit for example.

unfortunaly one of your WUs crashed last night on my G92-test-machine while it wasn't connected to the internet. the GPU then did sleep until morning like all normal GPUkind because this machine can't maintain more than one WU... very *beep* if you wish to keep this thing crunching... would never had that sleep when a secound workunit had been waiting... :-)

don't have any further suggestions at this time. i'm writing my own boinc server for a project and this can handle timeout-after-limit settings even by now. when it's done maybe it can handle a small set of work-limit-settings for every single host which needs them (have to test how much server load can be used for such features). but it's (like almost every project) cpu-related, making work scheduling easier (until the boinc framework is GPU-ready).

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9857 - Posted: 16 May 2009 | 13:44:54 UTC - in response to Message 9851.

I think they're wasting their time trying to debug 6.6.x. The system is fundamentally flawed, they need a clean differentiation between different co-processors. Otherwise all they can do is trying to catch strange errors which are "impossible". But whom am I telling that ;)

Sadly it is not quite that simple either. For one thing what about mixed-mode tasks?

The Lattice Project is planning on tasks that will use both a high amount of CPU while also using a GPU at the same time... SaH is moving in that direction though right now they are working on a fall-back option for tasks that don't run well on the GPU to run them on the CPU ... For example many people are killing VLAR tasks rather than to run them on CUDA because they take about 4-6 times longer than other "normal" tasks.

But, that bolixes up the idea of separate shares for resource per project.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9859 - Posted: 16 May 2009 | 14:15:15 UTC - in response to Message 9857.

Well, factor the "1.0 CUDA, 0.03 CPU" in. Ignore the 0.x CPU regarding ressource share. In a project mixes tasks with high and low cpu usage it will be difficult to maintain the correct cache size.. but the current system would have no way of dealing with such a situation, either.

I understand the entire scheduling / ressource allocation / work fetch is much more complex. Still I'm convinced: it's better to build a complex structure on a solid foundation than trying to build it on something you already know is not adequate. It may seem like a pain to set up, but in the end may save you even more in debug time.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9862 - Posted: 16 May 2009 | 14:55:42 UTC - in response to Message 9859.

Well, factor the "1.0 CUDA, 0.03 CPU" in. Ignore the 0.x CPU regarding ressource share. In a project mixes tasks with high and low cpu usage it will be difficult to maintain the correct cache size.. but the current system would have no way of dealing with such a situation, either.

I understand the entire scheduling / ressource allocation / work fetch is much more complex. Still I'm convinced: it's better to build a complex structure on a solid foundation than trying to build it on something you already know is not adequate. It may seem like a pain to set up, but in the end may save you even more in debug time.

Well, I started a discussion of this, but the reception wasn't even warm enough to call it tepid.

Post to thread

Message boards : Graphics cards (GPUs) : can't get more than a single WU?

//