Advanced search

Message boards : Graphics cards (GPUs) : Redundant results

Author Message
Talknuser
Send message
Joined: 7 Apr 09
Posts: 4
Credit: 1,121,005
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8678 - Posted: 21 Apr 2009 | 7:13:20 UTC

How many people are receiving the same workunit and exactly what does the time-limit mean?

I've had two units cancelled now after working on them for like 10 hours: 552067 and 542428. Both units would have finished well within the time limit!

Is the cancellation an error or project policy?

In other words; am I wasting my time here?
____________

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8680 - Posted: 21 Apr 2009 | 9:47:54 UTC - in response to Message 8678.
Last modified: 21 Apr 2009 | 9:51:00 UTC

I have had one cancelled in the last twenty, but that one was not running. I also note another post on this two days ago that remains unanswered. WUs should not be cancelled pre-emptively if they are running.

I also would be grateful for a response on this, cancellation of models already running is an abuse of donated free time & resources and should not happen. Its a good idea to cancel redundent WUs that have yet to run. I can understand how they can become redundent, and I applaud the existence of such a facility - its win/win all round for all concerned.

However it is not acceptable to cancel those already running without pro rata credit for effort already expended, or at the very least quietly kill them off on upload. There is a high level of Trust involved in crunching what we allow automatically on machines as a free donated resource and free personal time & effort - to be reliable, safe and of value. Pre-emptive action in the way this appears to be implemented, is abuse of that Trust, and is a matter of important Principle.

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8688 - Posted: 21 Apr 2009 | 21:32:41 UTC

Mhh, I was not aware that WUs are also canceled if they already started. This is good for the project, but it's unacceptable for the participant if 0 credits are awarded for x hours of GPU time.

Talknuser, could you provide a link to the WUs in question or (temporarly) unhide your computers?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8689 - Posted: 21 Apr 2009 | 21:34:33 UTC

I am *NOT* a member of the project, but, there are all kinds of technical reasons for the project to issue work and then cancel tasks that have not been started.

The standard cancellation tool in the BOINC system will not cancel tasks that you have started to process so that you should get credit.

THe point is, that for what ever the reason is, the project is in fact looking out for all of us by canceling work that is not needed.

BOINC has some automated mechanisms but sometimes they don't have just the perfect control so that we would never see tasks downloaded and not needing running.

In truth, they HAVE canceled streams of tasks before they were issued too ... so ...

If you can't stand it then the alternative is to leave (sadly) because this is the nature of the beast here ... sometimes we get tasks that get "recalled" ... heck, I probably get more of them than just about anyone ... look in my account for computer w2 ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8695 - Posted: 21 Apr 2009 | 22:00:02 UTC - in response to Message 8689.

Paul, the problem here is that he says these canceled WUs had already started and quite some computation time went into them.
For project speed it's still better to cancel them if they become redundant. But it's not fair to the user, so this should not happen, unless credits are payed via trickles or some similar system.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8697 - Posted: 21 Apr 2009 | 22:03:30 UTC - in response to Message 8695.

Paul, the problem here is that he says these canceled WUs had already started and quite some computation time went into them.
For project speed it's still better to cancel them if they become redundant. But it's not fair to the user, so this should not happen, unless credits are payed via trickles or some similar system.

MrS

That shows you how bad I am doing today ... I guess I shoiud quite while Iam behind ...

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8702 - Posted: 21 Apr 2009 | 22:09:36 UTC - in response to Message 8697.

Well, I also just did a stupid mistake (another thread) and refuse to recognize that it's actually time for me to go into bed since a half hour. See you tomorrow ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Talknuser
Send message
Joined: 7 Apr 09
Posts: 4
Credit: 1,121,005
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8709 - Posted: 22 Apr 2009 | 6:18:34 UTC - in response to Message 8702.

Unhidden - only have 2 machines here so far :)

The Work unit ID shows that no work is done, but that is not the case!

Although I was not there to actually watch the cancellation, this smallish rig was well underway which both results at the time I left it...

Then, suddenly they were cancelled/redundant. Not the end of the world, but certainly a waste of time, and definitely a problem to smaller machines if this is a general issue...
____________

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8721 - Posted: 22 Apr 2009 | 14:20:20 UTC - in response to Message 8709.

I don't think this is a general issue, I keep pretty close track of my WUs as I am sure many other people do also. I hestitate to say this but ... is it possible you made a mistake when you looked at the tasks? You did have one error and one complete. Can your card actually crunch two Wus at the same time... the task list will say "In Process" as soon as it gets sent to you, it does not mean the are actively being crunched all at the same time. Currently you have two tasks that say "Im Process" but if you check BOINC Manager I think you will see 1 that is processing and another that is waiting to run.

Steve

Talknuser
Send message
Joined: 7 Apr 09
Posts: 4
Credit: 1,121,005
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8734 - Posted: 22 Apr 2009 | 18:32:11 UTC - in response to Message 8721.

@Snow Crash

Like you I like to keep tabs on my units - at least when I start a project. When I'm sure things work I don't care ;)

And there's no chance this could be a mistake. The first one was an error for some reason - probably because I was still setting up the linux box at the time.
The second one got cancelled by the server with 10 hours or more completed. Same thing happened to #3.

Number 4 was actually allowed to finish and upload in time :)

Let's see what happens to #5 and #6 ;)

Anyway, this is really not worth wasting time on as no one from the project seems to bother. I only reported it because cycles were being wasted, and because I was not the first one to have this problem...

Have fun out there :)
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8745 - Posted: 22 Apr 2009 | 21:04:13 UTC - in response to Message 8734.

I assume the host you're talking about has to be this one. Let's try to dissect what's happening:
(I'm assuming your machine runs 24/7 and that linux BOINC doesn't suspend running GPU-Grid tasks.)


- 1st WU recieved should have started 1st. It supposedly ran for 22.5h, until it was canceled at 9:48 on the 19th.

- 2nd WU sent to you ran supposedly for 10h after the 1st one was canceled. It stopped with an error and at that time lists 9h of CPU time.

Does GPU-Grid occupy an entire core of your linux machine? If it does, the above looks probable. If not, say it's "only" 30 - 50% of one core, the situation looks different: in this case the 2nd WU could not have accumulated that much cpu time within 10h and thus must have been running before the 1st WU was canceled. Which would likely mean that the 1st WU did not yet start when it was canceled.


- the next 2 WUs were sent at the same time, so we can't say which one started first. Let's call them S for success and F for fail.

- S may have run from 8:54 on the 20th until 11:16 on the 22nd. That's 50:22h of wall clock time. It registers a run time of 47:05h. So under perfect conditions (i.e. WU runtime = wall clock time) there was a maximum possible runtime interval of 3:17h for "F".

- F could in principle have run before or after S. However, after S is impossible becuase S finished on the 22nd, whereas F was canceled on the 21st. If we assume F ran before S there's another problem: it would have started on the 20th and was aborted on the 21th. So it would have run for far more than the maximum of 3:17h, which it is allowed due to the minimum runtime of S.

-> if the linux BOINC doesn't suspend running GPU-Grid tasks it is clear that F could not have been started when it was aborted.


Anyway, this is really not worth wasting time on as no one from the project seems to bother. I only reported it because cycles were being wasted, and because I was not the first one to have this problem...


Your report is greatly appreciated! The project can't fix what they don't know. And the project staff is quite busy, so they usually only reply if they have something worthwhile to say. No reply doesn't mean they're not watching :)

MrS
____________
Scanning for our furry friends since Jan 2002

Talknuser
Send message
Joined: 7 Apr 09
Posts: 4
Credit: 1,121,005
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 8769 - Posted: 23 Apr 2009 | 8:59:54 UTC - in response to Message 8745.

@ ETA

Thanks for your in debth breakdown, which made me think :)

A couple of comments/observations:

* The unit with the error actually ran first, as #1 got stuck in the download queue.

* I've been monitoring the box closely today, and Boinc, for some reason, seems to allocate only 0.05 CPU to GPUGRID, meaning that this particular box (running 24/7) in practice runs GPUGRID for only 40% of the time as opposed to the expected 100% :(

With the above in mind, unit #2 (the erroneuos one that ran first) actually would not have stopped running until after about 22.5 hours (as opposed to 9 hours), which is AFTER the two units in question were cancelled. Meaning that neither of the cancelled units would have had time to start!

So, provided the above observations/assumptions are correct for the whole period, nothing was in fact wasted - except your time and mine ;)

Sorry to miss that this box was not running full tilt guys :)

Next step is to find out why, but I won't bother you with that :D
____________

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8771 - Posted: 23 Apr 2009 | 9:27:17 UTC - in response to Message 8769.
Last modified: 23 Apr 2009 | 9:34:17 UTC

The level of cpu use at 0.05% is a good thing. It indicates a low level of cpu involvement in the gpu application. In gpu crunching the cpu is there to load up the gpu with initial data set (hence the pause when a gpu wu first starts, the data is being loaded by the cpu into the gpu), and also passes "what to do next" instructions to the gpu. The gpu - in crude terms - is inherently stupid compared to the cpu, as it does not have integral instruction sets, its a pure number cruncher, and relies on the cpu to drive it and tell it what to do next.

The lower the cpu number the better, as is it indicates a more efficient gpu app. The latter then frees the cpu to get on with other things, such as more time to crunch a cpu based application - or let you get on with the latest powerpoint presentation, etc etc, with minimal to zero lag/disruption.

Many BOINC gpu projects have much higher cpu assist percentages, the low number is a pat on the back to the gpu app devs, not a figure of concern or worry.

In SETI for example, their CUDA wu (non optimised) runs at 0.15% cpu assist, using an optimised SETI app it will run at 0.04%. The slightly higher figure of 0.05% in GPUGRID is an indicator of the complexity of the model being run compared to SETI's.

Regards
Zy

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 837,894
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 8772 - Posted: 23 Apr 2009 | 9:43:03 UTC - in response to Message 8771.
Last modified: 23 Apr 2009 | 9:43:24 UTC

The lower the cpu number the better, as is it indicates a more efficient gpu app.


I never really thought about it much, but I suspect that the amount of CPU used will have a lot to do with the speed of the CPU vs. the speed of the GPU, not just the efficiency of the software.

Put a monster video card in a computer and you'll see a much higher CPU usage than you would with a mediocre video card. The CPU has to work that much harder to keep the GPU running. Same effect with using a slow CPU vs. a fast CPU.

Put an 8000 class video card in an i7 machine and I'm sure you'll see *very* low CPU utilization!
____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8774 - Posted: 23 Apr 2009 | 9:59:22 UTC - in response to Message 8772.
Last modified: 23 Apr 2009 | 10:00:55 UTC

Valid point, and does, to a degree, have just that effect. Inevitably there is a "floor" below which the cpu assist number will not go, it will never be zero as the gpu has no inherent Instruction Set "Intelligence". As gpu applications become more refined, they will perform faster, as the gpu app is tweeked to both perform the maths in a more efficient way, and ask for less help from the cpu.

GPU crunching is still in its infancy, there is a lot of "sledgehammer to crack a nut" going on inside the beast, and there is a huge latent power lurking in there yet to be fully tapped. The MW WU explosion was due to a model written especially for the gpu, not just an "adapted" cpu model. Other factors were clearly involved, double precision/single precision yaddie yadda that lead to short term fanboyism re ATI/NVidia cards. In truth in the long term it will even out in performance/card vendor terms as gpu apps become more refined and specially written for a gpu.

The low cpu involvement is why the lower power cpu based machines can still produce cracking results with a gpu app, the gpu is doing all the work. In those cases there are hardware issues such as can the "older" cpu run the card on its motherboard in terms of data throughput (x16 x8 channel PCI etc etc).

Regards
Zy

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 837,894
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 8776 - Posted: 23 Apr 2009 | 11:24:17 UTC - in response to Message 8774.

... as the gpu has no inherent Instruction Set "Intelligence".


Are you SURE about that? I'm pretty sure the GPUs are actually full blown computer cores. (Probably not x86-ish type CPUs; I think they're custom RISC processors.)

Granted, it's been about six months since I read through the documentation that comes with the CUDA SDK, but my impression was that the multi-processors on the Nvidia cards are complete CPU's in and of themselves.

Yes, there are vector processors (aka "shaders") on the cards. But each group of 8 shaders is attached to one of these multiprocessors which have full instruction sets. A GTX280 or 285, for example, has 30 multiprocessors -- essentially it's a (somewhat slow) 30-core CPU. What makes it so powerful is that the arithmetic unit on each of those cores is a vector processing unit that can do 8 calculations in parallel. Not to mention that there are 30 of those cores, which, in aggregate, have a total of 240 shaders.

It's possible that I'm misremembering what I read, or perhaps I misunderstood it, but my impression was that the CUDA processors could handle arbitrarily complicated programs all by themselves. The only shortcomings would be if you needed more memory than was available on the video card, or if I/O was required. Then you needed some coordination with the CPU.

The tricky part (and this applies to any vector processing system, not just CUDA), is writing the code in such a way that the parallelism is exploited to its fullest. That is a quite complex topic. (For example, if you have a branch instruction (an IF statement in a high level language) on a vector processor, what happens if the 8 different shaders/vector-processors don't yield the same branch result?)

Mike

uBronan
Avatar
Send message
Joined: 1 Feb 09
Posts: 139
Credit: 575,023
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 8778 - Posted: 23 Apr 2009 | 11:57:50 UTC
Last modified: 23 Apr 2009 | 12:09:10 UTC

I don't agree with you that GPU's are full blown processors they are made to do some tasks but do them as fast a possible, and are not nearly as complex as a CPU.
Maybe in time we will see this change because they can be made some kind of intelligent but for now basically are raw data monsters.
They calculate some instructions and indeed because they are split up in 240 smaller ones do it lightning fast.
Still a CPU will tell it what todo and feeds it with a packet which it can work on and then go back to other work till it gets a signal from the GPU that it did the job.
So in every way the GPU is just a simple co-processor which can calculate fast.

PS Look at the mythbuster example about GPU the cpu is made to let the robot move in a circle and shoot some paint pellets and then move to the middle to make the eyes and mouth, but the gpu simply shoots many colors at once making it look like it did more.
But you can't compare them at all because the cpu have to make very complex moves and extras to come to a result while the gpu cannon just had to shoot all the pellets at once.
So in itself the GPU did more work yes but with very very simple instructions

Profile Michael Goetz
Avatar
Send message
Joined: 2 Mar 09
Posts: 124
Credit: 46,573,744
RAC: 837,894
Level
Val
Scientific publications
watwatwatwatwatwatwatwat
Message 8781 - Posted: 23 Apr 2009 | 12:33:29 UTC - in response to Message 8778.
Last modified: 23 Apr 2009 | 12:54:47 UTC

EDIT: post greatly shortened; I'm not going to argue about this. Read for yourself:

Here's the CUDA SDK documentation: http://www.nvidia.com/object/cuda_develop.html. In particular, you might want to take a look at this document.

Mike

[AF>France] Thierry Corne...
Send message
Joined: 19 Apr 09
Posts: 1
Credit: 1,053,798
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 8861 - Posted: 24 Apr 2009 | 19:30:06 UTC

569200 404560 23 Apr 2009 7:23:33 UTC 24 Apr 2009 8:24:51 UTC Over Redundant result Cancelled by server 0.00 --- ---
569139 404534 23 Apr 2009 7:24:14 UTC 24 Apr 2009 18:25:03 UTC Over Redundant result Cancelled by server 0.00 --- ---
554332 397382 20 Apr 2009 18:53:34 UTC 21 Apr 2009 19:21:46 UTC Over Client error Aborted by user 0.00 0.00 ---
549590 395293 19 Apr 2009 20:44:48 UTC 21 Apr 2009 18:50:53 UTC Over Redundant result Cancelled by server 0.00 --- ---
549584 395292 19 Apr 2009 20:44:48 UTC 21 Apr 2009 18:49:42 UTC Over Redundant result Cancelled by server 0.00 --- ---
549466 395240 19 Apr 2009 20:45:23 UTC 20 Apr 2009 17:12:28 UTC Over Redundant result Cancelled by server 0.00 --- ---

A very strange Boinc project. Working Working working without any credit
I give computer time not only for the credits (some I'm used to crunch for don't give much per hour) but this project has really the world record !!! 0 credits per hour
At least is it a usefull project ???

I was happy that my GPU could help boinc projects but I'm going to leave this project without any regret if there's no way for me to get at least one credit...
Is it normal that when you finish your WU not the first(but before max time of course) you get not even one credit ??? Please at least one so that I get more than 0 credit after hours of hard work ;-)
I'm new on this project.
Maybe it's a temporary bug. Any help ?



____________

Profile Paul D. Buck
Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 8864 - Posted: 24 Apr 2009 | 19:51:38 UTC

Note the zero compute time ... you lost nothing.

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 8866 - Posted: 24 Apr 2009 | 19:57:21 UTC - in response to Message 8861.

You started none of this workunits.
You lost no time.

ignasi

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8896 - Posted: 25 Apr 2009 | 12:52:49 UTC

@Talknuser

Your post sounds as if you take the 0.05 CPU (5%) from the BOINC manager. This is just a number whose meaning I can not figure out (i.e. how it's generated, in earlier versions it was set by the project, but now it seems to be different on a per-host basis).

To get the actual cpu usage you'd have to take a look at your task manager. Under linux I'd open a console and type "top", if I look for a task with relatively high cpu usage (should be the case). Now there should be a list of running tasks and I think the cpu usage is displayed per cpu core, i.e. with a quad core you can have 4 tasks at 100% each. Look for the GPU-Grid task (aecmd-something I think). I suppose you'll see between 30 and 50% usage of one cpu core.

If you can't find the task, but you know a part of its name, you could use grep to search the output.. forgot the syntax, though. Oh, and it could be that modern linuxs also have some kind of task manager, which could be more convenient.

MrS
____________
Scanning for our furry friends since Jan 2002

Clownius
Send message
Joined: 19 Feb 09
Posts: 37
Credit: 30,657,566
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwat
Message 8900 - Posted: 25 Apr 2009 | 13:04:55 UTC

If its a KDE based distro try ksysguard its fairly good i use it on Kubuntu.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 8901 - Posted: 25 Apr 2009 | 13:06:59 UTC

@Zydor & Michael

There is no such thing as "instruction set intelligence".. that would devaluate the miracle which our brain is a little too much ;)

However, there is such a thing as instruction set complexity. And the ability to execute complex instructions. And the ability to execute complex flow control instructions. The latter is what the CPU is made for: deal with all those branches and conditions (if, while etc.) quickly. Current GPUs also support such instructions (doesn't matter to which extend), but they are much slower at executing these than they execute "regular" code.

If one wanted to make them more efficient for such code one would end up with an i7 with a wider vector unit attached. Or at least a Pentium 1 with a wide vector unit. Uh, sting me a Larrabee if we ever actually get a chip like that..

MrS
____________
Scanning for our furry friends since Jan 2002

Profile X1900AIW
Send message
Joined: 12 Sep 08
Posts: 74
Credit: 23,566,124
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 9000 - Posted: 27 Apr 2009 | 17:22:00 UTC - in response to Message 8678.

589184
Redundant while crunched on it for hours ? O.k. I had a good run with the settings a few days, but I´ll switch this host now to folding@home.

CPU time 2374.787
...
Outcome Redundant result
Client state Cancelled by server
Exit status -221 (0xffffffffffffff23)
...
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x77E6000C

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9029 - Posted: 27 Apr 2009 | 21:44:09 UTC

Someone from the project team cancelled too many WUs while trying to fix the download problems (which were due to outdated WUs).

MrS
____________
Scanning for our furry friends since Jan 2002

Jurgen
Send message
Joined: 10 Jan 09
Posts: 3
Credit: 114,473,253
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9960 - Posted: 19 May 2009 | 3:13:34 UTC

Here are my observations regarding the Redundant result issue:

It seems the "IBUCH"-workunits are send out twice and whoever finishes the WU first gets credit and the other participant gets a cancelled by server/Redundant result error and gets no credit, even if that participant reports his/her result minutes later and well before the deadline.

Example: http://www.gpugrid.net/workunit.php?wuid=466213

Note the "initial replication" parameter = 2. All the other WUs have an initial replication of 1.

This is not fair for participants with older cards who will always loose out against the GTX 295's and will never get any credit for these type of WUs. I had several of these cases happen, so now I manually abort these type of WUs when I happen to notice one of them in my queue.

It's ok to send to same WU to several participants, the Seti@HOME project does that by default, but everybody who completes the WU in time should get the credit he/she deserves.

Profile Zydor
Send message
Joined: 8 Feb 09
Posts: 252
Credit: 1,309,451
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 9962 - Posted: 19 May 2009 | 9:32:23 UTC - in response to Message 9960.
Last modified: 19 May 2009 | 9:39:36 UTC

This was cleared up a while back, there was a suspicion that was happening, but in fact it was shown at that time that the WU in question had not started on the machine in question. If it has returned as an issue I suspect they will jump on it, as that is not "as designed".

The server cancel facility is only designed to run on WUs that have not started on a machine. Those that have started, still get the credit if successfully completed. If they get cancelled in mid-crunch then a bug has surfaced, the principle of complete if started was the intent when the facility was first implemented.

Regards
Zy

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9972 - Posted: 19 May 2009 | 19:40:34 UTC

100% agreed with Zydor.

Jurgen, how do you know that "that participant reports his/her result minutes later and well before the deadline"? Sure, the logged completion time is shortly after the first result is returned. But nowhere is it saying that work had laready started. Note the exactly 0s of cpu time.. even if WUs error out instantaneously they mostly register 1 - 3s of cpu time. So it looks like this result was aborted before it had started.

MrS
____________
Scanning for our furry friends since Jan 2002

Jurgen
Send message
Joined: 10 Jan 09
Posts: 3
Credit: 114,473,253
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 9984 - Posted: 20 May 2009 | 1:23:07 UTC - in response to Message 9972.

I've seen this happen; an WU was 90+% complete, but when I check an hour later, somebody else reported results 10 minutes before my WU completed and I got the old "Redundant Result" stuff. I just received another of these WUs, # 475454.

http://www.gpugrid.net/workunit.php?wuid=475454

So for the record: processing has started; I'm at 1%... I made a screenshot, not sure how to upload pictures. The WU was also sent to another participant with an GTX 295 - I'll be creamed for sure. ;-)

Will babysit to see what happens and post a follow up.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10018 - Posted: 21 May 2009 | 9:46:57 UTC - in response to Message 9984.

I'll be creamed for sure. ;-)


Doesn't look that spectacular for now ;)

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Maurice Goulois
Avatar
Send message
Joined: 22 Feb 09
Posts: 10
Credit: 103,904,673
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10060 - Posted: 22 May 2009 | 2:57:11 UTC

Hi there,

I'm experiencing such a suspect behaviour on my own. I have one machine attached to GPUGRID since months and I've recently removed the SETI project because of the sluggishness that it puts on my system. On that matter the GPUGRID is much better in not disturbing the other activities I have on this PC.

So since about 10 days, this PC is dedicated to GPUGRID, and since then I've got more than a half of my WUs cancelled as "redundant results" and no credit. The problem is that this machine runs 24/7 GPUGRID (with a 8800GT which takes about a day to complete most WUs).

As a test on the cancellation of started WUs, I've just suspended the current running WU to force the second one to start and then reverted so that it continues the first one with an earlier deadline. I'll see how these two ones behave after upload and maybe cancellation.

I'll let you know about.

Regards
____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10069 - Posted: 22 May 2009 | 21:04:09 UTC - in response to Message 10060.

Could you make sure that your 2nd WU checkpointed at least once? I.e. when you shut down the BOINc client and restart it should not start at 0.000% again.

Is it this WU? You can watch your wingman: after he returns his WU your WU would be finished the next time you contact the scheduler - if it has not started yet.

MrS
____________
Scanning for our furry friends since Jan 2002

Jurgen
Send message
Joined: 10 Jan 09
Posts: 3
Credit: 114,473,253
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10078 - Posted: 23 May 2009 | 1:04:58 UTC - in response to Message 9984.

I've seen this happen; an WU was 90+% complete, but when I check an hour later, somebody else reported results 10 minutes before my WU completed and I got the old "Redundant Result" stuff. I just received another of these WUs, # 475454.

http://www.gpugrid.net/workunit.php?wuid=475454

So for the record: processing has started; I'm at 1%... I made a screenshot, not sure how to upload pictures. The WU was also sent to another participant with an GTX 295 - I'll be creamed for sure. ;-)

Will babysit to see what happens and post a follow up.



Update: the test was inconclusive, as I was the first user to succesfully finished crunching the WU. The other participant still shows as "In Progress"... only if the status changes to "success" and credits are also awarded we can conclude that there aren't any issues. I did notice that for some WUs that were distributed to more than one user, credit got awarded to more than one user, so at this time I now concur with Zydor that all works fine.

Profile Maurice Goulois
Avatar
Send message
Joined: 22 Feb 09
Posts: 10
Credit: 103,904,673
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10080 - Posted: 23 May 2009 | 2:54:04 UTC

Hi there again,

my problem is related to the task 25-KASHIF_HIVPR_n1_for_ba3-9-100-RND6818 (http://www.gpugrid.net/workunit.php?wuid=467482) that was blocking at 24.820% and avoiding the other tasks to start. I've cancelled it and I'll keep an eye on the next days.


____________

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10089 - Posted: 23 May 2009 | 14:45:18 UTC - in response to Message 10078.
Last modified: 23 May 2009 | 14:46:45 UTC

... only if the status changes to "success" and credits are also awarded we can conclude that there aren't any issues. I did notice that for some WUs that were distributed to more than one user, credit got awarded to more than one user, so at this time I now concur with Zydor that all works fine.


Well.. no. We already know that the system works as expected most of the time. For example take a look at my results.. there are a few redundant results, but the WU return times are so regular that I don't think the machine wasted any time on them. And there are quite a few succesful returns from 2 hosts and both got credit.

The point was that people reported "I've been watching it and I know something went wrong". So we need to confirm an error, otherwise we still know it's fine :)
(in one case I analyzed the tasks in detail and we found out that actually everything had been alright.. but now there were 1 or 2 new reports)

Edit: Maurice, did you try restarting BOINC? If not you may want to try that first if another task hangs. Sometimes that's enough to get it going until it finishes.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Maurice Goulois
Avatar
Send message
Joined: 22 Feb 09
Posts: 10
Credit: 103,904,673
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 10183 - Posted: 26 May 2009 | 10:42:39 UTC - in response to Message 10080.

Hi,

I confirm that my problem was related to the blocked WU, everything ok since its cancellation.


____________

Post to thread

Message boards : Graphics cards (GPUs) : Redundant results

//