Advanced search

Message boards : Graphics cards (GPUs) : Boinc 6.6.24 imminent

Author Message
jrobbio
Send message
Joined: 13 Mar 09
Posts: 59
Credit: 324,366
RAC: 0
Level

Scientific publications
watwatwatwat
Message 8810 - Posted: 24 Apr 2009 | 0:03:18 UTC

I was having a nosey at the Boinc Trac and spotted that they had just changed version to 6.6.24 (looks like it will be compiled in unicode and is GPL3).

I thought these fixes may be of interest to those that are getting errors and I've edited for GPU "highlights":

17869 and 17856: client: fix crash bug in CUDA init

17868 and 17855: * client: new approach to handling multiple GPUs.

old: find fastest GPU, and pretend that others are the same.

Problem: other GPUs might be less capable,
and not able to handle jobs sent by server.

new: find the most "capable" GPU, use others that are equivalent,

don't use those that are not.
"Capable" is defined by

* compute capability (i.e., hardware version)
* driver version
* memory size
* FLOPs
in that priority order.

17865 and 17847: * client: improve CPU sched debug messages

(say what kind of job and why we're scheduling it)

* client: log messages describing GPUs: one line per GPU; fixes #879

Rob

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8826 - Posted: 24 Apr 2009 | 11:29:32 UTC
Last modified: 24 Apr 2009 | 11:30:31 UTC

FOr those with more than one GPU it seems to disqualify the second GPU ... not sure why yet ...

6.6.24 *MAY* also fix some of the "thrashing" of tasks where tasks are preempted far more often than they should be ... I cannot tell of course yet as I am not going to give up one GPU just for amusement sake...

{edit}

New message looks like:
4/23/2009 9:00:17 PM CUDA device: GeForce GTX 295 (driver version 18250, compute capability 1.3, 896MB, est. 106GFLOPS)

Profile Kokomiko
Avatar
Send message
Joined: 18 Jul 08
Posts: 190
Credit: 24,093,690
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8876 - Posted: 24 Apr 2009 | 23:38:19 UTC

I have a GTX280 and a 8800GT in one system and can confirm this behavior. The 8800GT will not work under BOINC 6.6.24, so I had to downgrade to 6.6.23 to crunch with both cards at the same time for BOINC.
____________

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8880 - Posted: 25 Apr 2009 | 3:35:14 UTC - in response to Message 8876.

I have a GTX280 and a 8800GT in one system and can confirm this behavior. The 8800GT will not work under BOINC 6.6.24, so I had to downgrade to 6.6.23 to crunch with both cards at the same time for BOINC.

In your case it is because the new "improved" rules means that BOINC would only allow the GTX280 card to process work because it is "best". You would need to get another card with the same characteristics before it will allow more than one. In effect, the stricter rules means no mixed systems.

We are arguing on the alpha list for looser rules, though this will mean that the GPU scheduler will need to get "smarter"... at the moment it is hard to tell if we are getting through to SkyNet Systems ...

I understand the intent... but the cure is worse than the disease ...

Worse than that the code is bad so that the second GPU of whatever stripe seems to always be flagged as not equal to the others.... I cannot see the reason based on the code though I did catch one other minor error ... which is now fixed.

Profile Dieter Matuschek
Avatar
Send message
Joined: 28 Dec 08
Posts: 58
Credit: 231,884,297
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8883 - Posted: 25 Apr 2009 | 6:03:09 UTC

After 6.6.24 was installed GPUGRID WUs got the state "waiting to run" on both GPUs of the GTX 295 although GPUGRID is the only GPU project running!
So I rolled back to 6.6.20 immediately.
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8912 - Posted: 25 Apr 2009 | 13:47:04 UTC - in response to Message 8810.

new: find the most "capable" GPU, use others that are equivalent,

don't use those that are not.
"Capable" is defined by


Jeeez.. what are they thinking?! Didn't anyone tell them how popular it is to mix GPUs, for the single purpose of using all of them?

Why is it so difficult to

- make a smart decision on which coprocessors to use (e.g. default: all, but not when PC is in use)
- let the user specify in the options if he want's coprocessors to be excluded (display a list which ones are detected)

Sure, that's not the entire deal yet as there are many problems left. But still.. it seems so much better than what they suggest and it doesn't seem much more complicated.

Sure, they hate giving options to the user (to avoid micromanagement) and would rather have BOINC take care of everything automatically. But at some point they also have to admit that right now they can not even get the basics of co-processor handling straight. How are they ever planning on handling a system, where an ATI is meant to crunch only Milkyway, a CUDA 1.3 is meant to run GPU-Grid and SETI, a CUDA 1.0 is meant to only run SETI and a Larrabee does Einstein? An extreme example, but right now this would be totally hopeless.
[not that I'm saying Einstein would be developing a Larrabee client, I just wanted to give an example]

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8933 - Posted: 25 Apr 2009 | 17:53:28 UTC - in response to Message 8912.

new: find the most "capable" GPU, use others that are equivalent,

don't use those that are not.
"Capable" is defined by


Jeeez.. what are they thinking?! Didn't anyone tell them how popular it is to mix GPUs, for the single purpose of using all of them?

Yes we have ...

*NOW* you can see why I get so frustrated. I know some think I hate the BOINC developers because of the comments I make. But, that is not the reality. I hate the poor decisions they make. I'm Autistic, I don't do people things ... so I am not into liking or disliking people ... just what they do ... :)

If you browse the alpha mailing list I pointed out very real situations *I* have elected in the past, all of which are not legal in the new 6.6.24 system. Not to mention that the code seems to always reject device 1 (the second in the list) regardless of its capability. I have stared at the code and can figure that one out. Though I just submitted a suggestion so we might get a handle on this one if they add my test and output print.

Why is it so difficult to

- make a smart decision on which coprocessors to use (e.g. default: all, but not when PC is in use)
- let the user specify in the options if he want's coprocessors to be excluded (display a list which ones are detected)

I won't go into all the issues we have laid bare. But, there are more fundamental problems with the BOINC resource Scheduler (and work fetch for that matter) and I have talked about some of them over at Rosetta@Home in the thread Resource Share Obsolete? where I make a proposal to drop Resource Share and start using a simpler Priority system to control the work flows.

The problem, as simplistic as it might sound is partly NIH added with interta and I am suggesting that Dr. Anderson may have made a bad design decision in the conceptual model of BOINC. The last is usually the deal breaker and there is no appeal if Dr. A does not like your idea. But the truth is that he has this glorious vision that everyone is going to use BOINC as he thinks it is going to be used. The problem is that if you look at the statistics, well, they don't match.

His vision says that people will run out and join multiple projects. Truth? Most people join one project... if they join two or three the others are usually "safety" projects that they only run if there is a longer than normal outage of their core project.

I have done work for some 50 project (56? I forget the exact number, but it is up there) some of which are not alive any longer. My normal attachment is to about 48 projects that have work either steady or intermittently (most are intermittent). Thus, I am one of only 3,000 and change people that have attached and done work for this many projects.

Anyway, you can read the discussion of why RS does not map well to these user preferences and how Priority does better ... not that it is a complete design yet ...

Sure, that's not the entire deal yet as there are many problems left. But still.. it seems so much better than what they suggest and it doesn't seem much more complicated.

Sure, they hate giving options to the user (to avoid micromanagement) and would rather have BOINC take care of everything automatically. But at some point they also have to admit that right now they can not even get the basics of co-processor handling straight. How are they ever planning on handling a system, where an ATI is meant to crunch only Milkyway, a CUDA 1.3 is meant to run GPU-Grid and SETI, a CUDA 1.0 is meant to only run SETI and a Larrabee does Einstein? An extreme example, but right now this would be totally hopeless.
[not that I'm saying Einstein would be developing a Larrabee client, I just wanted to give an example]

The most fundamental and simplest problem is if I have something like a GTX295, GTX 260 and a 8600GT (or slower, not sure I have a good example here) ... if I am attached to only GPU Grid it may be simple to detect and reject the use of the slowest card and not to use it at all. If I am attached to GPU Grid and SaH I can allocate the slower card to SaH and move on with life. The first proplem is that Resource Shares may not be properly served by this arrangement ... another is that what do I do in the cases of deadline peril ...

Anyway, it is complicated ... but that is what we should be gaming out a little bit ... it *IS* going to be a sore spot if they opt for this draconian measure.

I would suggest that you use your influence with the project to have the project lead drop a line to Dr. Anderson ... he is far more responsive to missives from project leads than from peons like me ...

By this I mean GDF telling him that GPU Grid has 20% mixed systems and that this change will cost ... blah ...blah ... blah ...

Profile JockMacMad TSBT
Send message
Joined: 26 Jan 09
Posts: 31
Credit: 3,877,912
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9002 - Posted: 27 Apr 2009 | 18:25:45 UTC
Last modified: 27 Apr 2009 | 18:27:04 UTC

Okay after updating to 6.6.24 BAD things happen.

On my solo GTX-260 machine I can no longer run ABC, Rosetta and GPUGrid.

GPUGrid is showing 0.43 CPU, 1 CUDA. At this the CPU tasks run but no GPU. If I suspend ABC and Rosetta, GPUGrid runs fine (if a little slow I'm thinking I need to time it). If I turn ABC on which has only 1 WU downloaded then both GPUGrid and ABC run. If I suspend ABC and resume Rosetta which has 10+WU's, 4 CPU schedule on the Q6600 plus GPUGrid, then all WU run with GPUGrid running for about 5 or 6 updates (tick type updates) then GPUGrid just stops. It says it's still running but after 15 mins it's stuck at the same % complete.

IS this the BOINC scheduler that's being a PITA?
____________

Profile rebirther
Avatar
Send message
Joined: 7 Jul 07
Posts: 53
Credit: 3,048,781
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 9004 - Posted: 27 Apr 2009 | 18:38:27 UTC - in response to Message 9002.

Okay after updating to 6.6.24 BAD things happen.

On my solo GTX-260 machine I can no longer run ABC, Rosetta and GPUGrid.

GPUGrid is showing 0.43 CPU, 1 CUDA. At this the CPU tasks run but no GPU. If I suspend ABC and Rosetta, GPUGrid runs fine (if a little slow I'm thinking I need to time it). If I turn ABC on which has only 1 WU downloaded then both GPUGrid and ABC run. If I suspend ABC and resume Rosetta which has 10+WU's, 4 CPU schedule on the Q6600 plus GPUGrid, then all WU run with GPUGrid running for about 5 or 6 updates (tick type updates) then GPUGrid just stops. It says it's still running but after 15 mins it's stuck at the same % complete.

IS this the BOINC scheduler that's being a PITA?


Only issues with these projects? Iam running RCN, freehal, wcg + GPUgrid without any problems on win32.

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9008 - Posted: 27 Apr 2009 | 19:26:16 UTC

6.6.x something broke the scheduler and work fetch. I don't know where exactly ... but I have seen reports from 6.6.20 on ... 6.6.20 also has the added fun of sometimes running GPU tasks VERY slowly ... 6.6.23 and on fixed this and made work fetch worse.

The only cure for work fetch is to go back to 6.5.0 or to get used to resetting debts as needed. On my i7 that is about once every day or so.

Sadly, "they" (yes the omnipresent "They") are wearing me down because they say they want to fix things as long as they don't have to change any of the causes of the problems ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9175 - Posted: 1 May 2009 | 14:00:50 UTC - in response to Message 8933.

I would suggest that you use your influence with the project to have the project lead drop a line to Dr. Anderson ... he is far more responsive to missives from project leads than from peons like me ...


Didn't have time for that yet. Considering the changes in 6.6.25, would it still be desirable to take any action? Did Berkeley already get the message that their new policy is.. not the best idea in the world?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9180 - Posted: 1 May 2009 | 20:12:50 UTC - in response to Message 9175.

I would suggest that you use your influence with the project to have the project lead drop a line to Dr. Anderson ... he is far more responsive to missives from project leads than from peons like me ...


Didn't have time for that yet. Considering the changes in 6.6.25, would it still be desirable to take any action? Did Berkeley already get the message that their new policy is.. not the best idea in the world?

NO, I don't think they got the message(s).

They can, and have, take the work-around right back out in the next build.

The point is that they are not considering in any shape or form that I can see the management of the CUDA cards in the machine. It is entirely possible that I could have 2 cards capable of GPU Grid and one that is so marginal I don't want to waste time with it, but it might be perfectly suitable for some other project. Yet, the system is still still really structured about "all-or-nothing" where if I include the GPU in it better be capable for all the attached projects.

And, we already know that is not the case...

Post to thread

Message boards : Graphics cards (GPUs) : Boinc 6.6.24 imminent

//