Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large
Author | Message |
---|---|
Problem: Remaining (Estimated) time is unusually high; duration correction factor unusually large | |
ID: 32263 | Rating: 0 | rate:
![]() ![]() ![]() | |
It could easily be caused by a wrong estimate in some WU. But sicne noone else replied it doesn't seem to be persistent issue. | |
ID: 32405 | Rating: 0 | rate:
![]() ![]() ![]() | |
There was at least 1 other user that reported it also, in the beta news thread. | |
ID: 32409 | Rating: 0 | rate:
![]() ![]() ![]() | |
I think the run time estimates are app based, so when a new batch of different WU's run's it takes time for their estimated run time to auto-correct (on everyone's systems). | |
ID: 32421 | Rating: 0 | rate:
![]() ![]() ![]() | |
There was at least 1 other user that reported it also, in the beta news thread. You could go into the client state xml file and edit the dcf for GPUGrid tasks to whatever suits you. Problem solved. | |
ID: 32432 | Rating: 0 | rate:
![]() ![]() ![]() | |
Until you get more tasks that throw it out of balance. | |
ID: 32433 | Rating: 0 | rate:
![]() ![]() ![]() | |
I see this too that the remaining time is not correct. I saw it first after we got a lot of MJH beta tests to optimize the app and then "normal" WU's again. | |
ID: 32436 | Rating: 0 | rate:
![]() ![]() ![]() | |
Until you get more tasks that throw it out of balance. Agreed - it's not an option to let everyone manually edit config files. Since yesterday you convinced me it's not an isolated issue. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 32491 | Rating: 0 | rate:
![]() ![]() ![]() | |
... And now my Duration Correction Factor is back up to 12! | |
ID: 32586 | Rating: 0 | rate:
![]() ![]() ![]() | |
Set aside the WUs which are failing with 'max time exceeded' errors, which are a symptom of another as yet unfixed problem, | |
ID: 32591 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm fairly certain DCF is very relevant. | |
ID: 32592 | Rating: 0 | rate:
![]() ![]() ![]() | |
Well if this DCF thing is keeping a running average of WU runtimes then I expect it is simply losing its marbles when it encounters one of the WUs that make no progress. | |
ID: 32593 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm not certain why it is "losing its marbles" lol, but I don't think it's related to a "no progress" task. | |
ID: 32594 | Rating: 0 | rate:
![]() ![]() ![]() | |
I agree with what Jacob says here. | |
ID: 32599 | Rating: 0 | rate:
![]() ![]() ![]() | |
My GTX 670 host 132158 has been running through a variety of Beta tasks <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> Giving identical estimates to such a wide range of tasks is going, I'm afraid, to end in tears. BOINC Manager has translated those fpops_est into hours and minutes for me. For the cuda55 plan class, the estimate is roughly right for the NOELIA full task I've just completed - 8 hours 41 minutes. But to get that figure, BOINC has had to apply a DCF of 18.4 - it should be somewhere round 1.0 One the other hand, one of the test units was allocated under cuda42, and was showing an estimate of 24 minutes. Same size task, different estimates, can only happen one way, and here it is: <app_name>acemdbeta</app_name> <version_num>804</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.571351</max_ncpus> <flops>52770755214365.750000</flops> <plan_class>cuda42</plan_class> <app_name>acemdbeta</app_name> <version_num>804</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>2945205633626.370100</flops> <plan_class>cuda55</plan_class> Pulling those out into a more legible format, my speeds are supposed to be: cuda42: 52,770,755,214,365 cuda55: 2,945,205,633,626 That's 3 TeraFlops for cuda55, and over 50 TeraFlops for cuda42 - no, I don't think so. The marketing people at NVidia, as interpreted by BOINC, do give these cards a rating of 2915 GFLOPS peak, but I think we all know that not every flop is usable in the real world - pesky things like PCIe bus transfers, and memory read/writes, get in the way. The APR (Average Processing Rate) figures for that host are shown on the application details page. For mainstream processing under cuda42, I'm showing 147 (units: GigaFlops) for both short and long runs. About one-twentieth of theoretical peak feels about right, and matches the new DCF. For the new Beta runs I'm showing crazy speeds here as well - up to (and above) 80 TeraFlops. Guys, I'm sorry, but that's what happens when you send out 1-minute tasks without changing <rsc_fpops_est> from the value you used for 10 hour tasks. | |
ID: 32603 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yes, a million times over! | |
ID: 32605 | Rating: 0 | rate:
![]() ![]() ![]() | |
Disclaimer first: I speak as a long-term (interested) volunteer on number of BOINC projects, but I have no first-hand experience of administering a BOINC server. What follows is my personal opinion only, but an opinion informed by seeing similar crises at a number of projects. | |
ID: 32607 | Rating: 0 | rate:
![]() ![]() ![]() | |
:) I hope we don't have to travel to France for a solution. | |
ID: 32608 | Rating: 0 | rate:
![]() ![]() ![]() | |
Many projects have separate apps (and queues) for different task types, as it makes things more manageable.
ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: - deprecated? ACEMD beta: ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2:
ACEMD beta version Long runs (8-12 hours on fastest card)
| |
ID: 32609 | Rating: 0 | rate:
![]() ![]() ![]() | |
Each task is sent with an estimate. You can even view that estimate in the task properties. It can be different amongst tasks within the same app. | |
ID: 32610 | Rating: 0 | rate:
![]() ![]() ![]() | |
The opposite happens too. I have now a NOELIA_KLEBEbeta-2-3... that has done 9% in 1 hour, but in 5 minutes it will be finished....? | |
ID: 32611 | Rating: 0 | rate:
![]() ![]() ![]() | |
Each task is sent with an estimate. You can even view that estimate in the task properties. There are a number of components which go towards calculating that estimate. After playing around with those Beta tasks yesterday, I've now been given a re-sent NATHAN_KIDKIXc22 from the long queue (WU 4743036), so we can see what will happen when all this is over. From <workunit> in client_state: <name>I6R6-NATHAN_KIDKIXc22_6-12-50-RND1527</name> <app_name>acemdlong</app_name> <version_num>803</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> From <app_version> in client_state: <app_name>acemdlong</app_name> <version_num>803</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.666596</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>142541780304.165830</flops> <plan_class>cuda55</plan_class> From <project> in client_state: <duration_correction_factor>19.676844</duration_correction_factor> It's the local BOINC client on your machine that puts all those figures into the calculator. Size: 5,000,000,000,000,000 (5 PetaFpops, 5 quadrillion calculations) Speed: 142,541,780,304 (142.5 GigaFlops) DCF: 19.67 Put those together, and my calculator gets 690,213 seconds - 192 hours or 8 days. 28% of the way through the task (in 2.5 hours), BOINC is still estimating 174 hours - over a week - to go: BOINC is very slow to switch from 'estimate' to 'experience' as a task is running. We're going to get a lot of panic (and possibly even aborted tasks) from inexperienced users before all this unwinds. | |
ID: 32618 | Rating: 0 | rate:
![]() ![]() ![]() | |
The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). | |
ID: 32626 | Rating: 0 | rate:
![]() ![]() ![]() | |
The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). It depends which version of the BOINC client you run. I'm testing new BOINC versions as they come out, too - that rig is currently one step behind, on v7.2.10 The behaviour of 'stopping the current task, and starting a later one' when in High Priority was acknowledged to have been a bug, and has been corrected now. BOINC is hoping to promote v7.2.xx to 'recommended' status soon - that should cure your annoyance. | |
ID: 32627 | Rating: 0 | rate:
![]() ![]() ![]() | |
This is really annoying. I agree, it is annoying. I reported it as soon as I spotted it, over a week ago. Hopefully the admins take it a bit more seriously. | |
ID: 32630 | Rating: 0 | rate:
![]() ![]() ![]() | |
This is really annoying. The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: EDF policy says we should run the ones with earliest deadlines. | |
ID: 32634 | Rating: 0 | rate:
![]() ![]() ![]() | |
The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). That's why I've posted about my annoyance here. This overestimation misleads the BOINC manager in another way: it won't ask for new work, since it thinks that there is enough work in its queue. Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: There is another annoying bug and an annoying GUI change, which makes me not to upgrade v6.10.60: The bug is in the calculation of the required CPU percentage for GPU tasks. It can change from below 0.5 to over 0.5. On a dual GPU system that change results in 1 CPU thread fluctuation. The v6.10.60 underestimates the required CPU percentage for Kepler based cards (0.04%), so the number of available CPUs won't fluctuate. This bug comes in handy. The annoying GUI change is the omitted "messages" tab (actually it's relocated to a submenu). | |
ID: 32636 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've just picked up a new Beta task - 7238202 | |
ID: 32639 | Rating: 0 | rate:
![]() ![]() ![]() | |
You could configure the CPU percentage your self with an app_config (which the later BOINCs support). I could send you mine, if you're interested. And the message.. well, it's annoying. But only really needed when there are problems. Which, fortunately, isn't all that often for me. | |
ID: 32640 | Rating: 0 | rate:
![]() ![]() ![]() | |
I've just picked up a new Beta task - 7238202 Completed in under 2 hours, and awarded 150,000 credits. This is getting silly. | |
ID: 32644 | Rating: 0 | rate:
![]() ![]() ![]() | |
Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. | |
ID: 32658 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yup, I just had some tasks fail because of that poor server estimation. Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. | |
ID: 32659 | Rating: 0 | rate:
![]() ![]() ![]() | |
This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. | |
ID: 32670 | Rating: 0 | rate:
![]() ![]() ![]() | |
All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) | |
ID: 32671 | Rating: 0 | rate:
![]() ![]() ![]() | |
Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. I have tried 3 betas in the long queue, and two have failed at almost exactly the same running times. One was a Noelia, and the other was a Harvey. (On a GTX 650 Ti under Win7 64-bit and BOINC 7.2.11 x64)
8.10 ACEMD beta version (cuda55) 66-MJHARVEY_TEST10-42-50-RND0504_0 01:58:02 Reported: Computation error (197,)
| |
ID: 32672 | Rating: 0 | rate:
![]() ![]() ![]() | |
I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved. MJH | |
ID: 32673 | Rating: 0 | rate:
![]() ![]() ![]() | |
Sorry. I thought you were asking about the betas. No problems with the longs thus far. All seven that I have received under CUDA 5.5 have been completed successfully.
8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-3-4-RND5262_0 17:41:48 8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-2-4-RND5262_0 17:39:49 8.03 Long runs (cuda55) 063ppx290-NOELIA_FRAG063pp-1-4-RND3152_0 20:59:32 8.03 Long runs (cuda55) I35R7-NATHAN_KIDKIXc22_6-8-50-RND8566_0 8.02 Long runs (cuda55) I50R6-NATHAN_KIDKIXc22_6-3-50-RND0333_0 17:48:16 8.00 Long runs (cuda55) I81R8-NATHAN_KIDKIXc22_6-4-50-RND0944_0 17:44:35 | |
ID: 32674 | Rating: 0 | rate:
![]() ![]() ![]() | |
All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) That's DCF in action. It will work itself down eventually, but may take 20 - 30 tasks with proper <rsc_fpops_est> to get there. Unless DCF has already reached over 90 - the normalisation process is slower in those extreme cases. | |
ID: 32676 | Rating: 0 | rate:
![]() ![]() ![]() | |
MJH, This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. | |
ID: 32677 | Rating: 0 | rate:
![]() ![]() ![]() | |
I think he means long tasks, like those NOELIA_KLEBE jobs, being processed through the Beta queue currently alongside your quick test pieces. The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast - it thinks my GTX 670 can complete ACEMD beta version 8.11 tasks (bottom of linked list) at 79.2 TeraFlops. When BOINC attempts any reasonably-long job (a bit over an hour, in my case), the client thinks something has gone wrong, and aborts it for taking too long. There's nothing the user can do to overcome that problem, except 'innocculate' each individual task as received, with a big (100x or 1000x) increase to <rsc_fpops_bound> | |
ID: 32679 | Rating: 0 | rate:
![]() ![]() ![]() | |
The server doesn't have any part in it - it's the client making that decision. anyway, the last batch of short WUs have been submitted, with no more to follow. Hopefully the client will be as quick to correct itself back as before. MJH | |
ID: 32682 | Rating: 0 | rate:
![]() ![]() ![]() | |
I beg to differ. Please have a look at http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation - the problem is the host_app_version table described under 'The New System'. You have that here: the sever calculates the effective speed of the host, based on an average of previously completed tasks. You can see - but not alter - those 'effective' speeds as 'Average Processing Rate' on the application details page for each host. That's what I linked for mine: the units are gigaflops. The server passes that effective flops rating to the client with each work allocation, and yes: it's the client which makes the final decision to abort work with EXIT_TIME_LIMIT_EXCEEDED - but it does so on the basis of data maintained and supplied by the server. | |
ID: 32683 | Rating: 0 | rate:
![]() ![]() ![]() | |
In the last month, my two machines seem to have stabilized for errors. Last errors were the server canceled runs Aug 24 and before that Jul 30 - Aug 11 with the NOELIA runs. | |
ID: 32684 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks all for te explanations, the path explained by Jacob aparently fix the times. | |
ID: 32685 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks all for te explanations, the path explained by Jacob aparently fix the times. Having consulted the usual oracle on that, we think it would be wise to wait a little longer. Although changing DCF will change the displayed estimates for runtime, we don't think it affects the underlying calculations for EXIT_TIME_LIMIT_EXCEEDED. And whatever you change, DCF is re-calculated every time a task exits. If you happen to draw another NOELIA_KLEBEbeta, DCF will go right back up through the roof in one jump when it finishes. The only solution is a new Beta app installation, with a new set of APRs - and that has to be done on the server. | |
ID: 32687 | Rating: 0 | rate:
![]() ![]() ![]() | |
Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. Thanks all for te explanations, the path explained by Jacob aparently fix the times. | |
ID: 32688 | Rating: 0 | rate:
![]() ![]() ![]() | |
Having consulted the usual oracle on that, we think it would be wise to wait a little longer. <Waiting>1<Waiting> hope not for a long time and thanks for the help.
+1 ____________ ![]() | |
ID: 32689 | Rating: 0 | rate:
![]() ![]() ![]() | |
Jacob - the beta testing is over now. 8.11 is the final revision and is out now on beta and short. Now, I know these wrong estimates have been a cause of frustration for you, but in fact the WUs haven't been going to waste - they've been doing enough work to help me fix the bugs I was looking at. MJH | |
ID: 32692 | Rating: 0 | rate:
![]() ![]() ![]() | |
MJH: | |
ID: 32693 | Rating: 0 | rate:
![]() ![]() ![]() | |
From a BOINC credit perspective, yes. But note that the completed WUs were receiving a generous award, which ought to have been some compensation. Importantly though, from a development perspective they were't wasted. The failures I was interested in happened very quickly after start-up. If the WU ran long enough for MTE, it had run long enough to accomplish its purpose.
Yes, of course! We've not tried this method of live debugging using short WUs before and weren't expecting this unfortunate side-effect. Next time the fpops estimate will be dialled down appropriately. MJH | |
ID: 32694 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thank you. That is what I/we needed to hear. We understand it's a bit of a learning experience, since you were trying a new way to weed out errors and move forward. I'm glad you know more about this issue - APRs and app versions - How they affect fpops estimated - How fpops bound ends up affecting Maximum Time Exceeded - How the client keeps track of estimation using a project-wide-variable [Duraction Correction Factor (DCF)] to show estimated times in the UI Next time, I'm sure it'll go much more smoothly :) Thanks for your responses. | |
ID: 32695 | Rating: 0 | rate:
![]() ![]() ![]() | |
I was just about to suggest that we wait until this urgent fine-tuning of the app was complete, but I see we've reached that point already if you're happy with v8.11 | |
ID: 32698 | Rating: 0 | rate:
![]() ![]() ![]() | |
Will disconnecting and re-attaching to the project force a reset?
Yes - 42 and 55 DLLs are both delivered irrespective of the app version. It's a side-effect of our deployment mechanism. Will probably fix it later. MJH | |
ID: 32702 | Rating: 0 | rate:
![]() ![]() ![]() | |
Yes, I believe it will, but the user loses all the local stats for the project, plus any files that had been downloaded. For resetting DCF, I prefer to close BOINC, carefully edit the client_state.xml file, then reopen BOINC. | |
ID: 32703 | Rating: 0 | rate:
![]() ![]() ![]() | |
I just got the following task: | |
ID: 32704 | Rating: 0 | rate:
![]() ![]() ![]() | |
For DCF - detach/reattach will fix it, as will a simple 'Reset project' from BOINC Manager. Both routes will kill any tasks in progress, and will force a re-download of applications, DLLs and new tasks. People would probably wish to wait for a pause between jobs to do this: set 'No new tasks'; complete, upload and report all current work; and only then reset the project. For APR - 'Reset project' will do nothing, except kill tasks in progress and force the download of new ones. Detach/re-attach might help, but the BOINC server code in general tries to re-assign the previous HostID to a re-attaching host (if it recognises the IP address, Domain Name, and hardware configuration). If you get the same HostID, you get the APR values and other application details back, too. There are ways of forcing a new HostID, but they involve deliberately invoking BOINC's anti-cheating mechanism by falsifying the RPC sequence number. | |
ID: 32705 | Rating: 0 | rate:
![]() ![]() ![]() | |
Richard, | |
ID: 32706 | Rating: 0 | rate:
![]() ![]() ![]() | |
The last KLEBE beta WUs are being deleted now. | |
ID: 32708 | Rating: 0 | rate:
![]() ![]() ![]() | |
Richard, Yes, that's the only way I know from a user perspective. There is supposed to be an Application Reset tool on the server operations web-admin page, but I don't know if it can be applied selectively: the code is here http://boinc.berkeley.edu/trac/browser/boinc-v2/html/ops/app_reset.php but there's no mention of it on the associated Wiki page http://boinc.berkeley.edu/trac/wiki/HtmlOps I'd advise consulting another BOINC server admistrator before touching it: Oliver Bock (shown as the most recent contributer to that code page) can be contacted via Einstein@home or the BOINC email lists, and is normally very helpful. | |
ID: 32710 | Rating: 0 | rate:
![]() ![]() ![]() | |
The last KLEBE beta WUs are being deleted now. Will that affect tasks in progress? :P I've just started one, with suitably modified <rsc_fpops_bound> (of course). | |
ID: 32711 | Rating: 0 | rate:
![]() ![]() ![]() | |
While I expect you jest, for the rest of those who might be reading this thread, recently some work was aborted by mistake; tasks in progress as well as non started work. However, it was mentioned that a mechanism to avoid this will be used in the future. | |
ID: 32712 | Rating: 0 | rate:
![]() ![]() ![]() | |
Huh, I thought all the TEST18s were done already. | |
ID: 32713 | Rating: 0 | rate:
![]() ![]() ![]() | |
Created 4 Sep 2013 | 20:18:12 UTC | |
ID: 32714 | Rating: 0 | rate:
![]() ![]() ![]() | |
I did a reset project, but the estimated time for a new WU afterwards is still wrong. One from 1.5 minutes was estimated 1h45m and a SANTI SR 7h40m52s. This will be faster already done 3% in 5m. | |
ID: 32715 | Rating: 0 | rate:
![]() ![]() ![]() | |
I thought the problem had been solved, but now I don't know. 063px55-NOELIA_KLEBEbeta-2-3-RND9896_0 4752598 4 Sep 2013 | 17:41:12 UTC 6 Sep 2013 | 4:58:59 UTC Completed and validated 106,238.51 11,702.89 119,000.00 ACEMD beta version v8.11 (cuda55) 124-MJHARVEY_CRASH1-0-25-RND3516_1 4754388 5 Sep 2013 | 16:39:18 UTC 6 Sep 2013 | 4:58:21 UTC Completed and validated 17,615.38 7,935.19 18,750.00 ACEMD beta version v8.13 (cuda55) 139-MJHARVEY_CRASH2-1-25-RND6442_0 4756103 6 Sep 2013 | 4:49:11 UTC 6 Sep 2013 | 7:51:44 UTC Completed and validated 10,781.12 10,669.78 18,750.00 ACEMD beta version v8.13 (cuda42) 196-MJHARVEY_CRASH2-1-25-RND1142_1 4756328 6 Sep 2013 | 7:51:00 UTC 6 Sep 2013 | 10:49:48 UTC Completed and validated 10,559.78 10,479.89 18,750.00 ACEMD beta version v8.13 (cuda55) 149-MJHARVEY_CRASH2-1-25-RND2885_0 4756187 6 Sep 2013 | 4:53:45 UTC 6 Sep 2013 | 11:00:47 UTC Completed and validated 21,826.75 5,600.89 18,750.00 ACEMD beta version v8.13 (cuda55) | |
ID: 32779 | Rating: 0 | rate:
![]() ![]() ![]() | |
All of the CRASH tasks are exact copies of SANTI-MAR4222s | |
ID: 32782 | Rating: 0 | rate:
![]() ![]() ![]() | |
Can you answer my question about the apps? | |
ID: 32783 | Rating: 0 | rate:
![]() ![]() ![]() | |
rsc_fpops_est is the same as when they went out on acemdshort. | |
ID: 32784 | Rating: 0 | rate:
![]() ![]() ![]() | |
Oh. I only know a little, but I'll share what I know. Richard knows a ton about it. | |
ID: 32785 | Rating: 0 | rate:
![]() ![]() ![]() | |
All of the CRASH tasks are exact copies of SANTI-MAR4222s That sounds promising. My 660 had big problems with Santi´s SR and LR, but so far all CRASH´s beta´s I got (4) finished with good result. ____________ Greetings from TJ | |
ID: 32789 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks Jacob, | |
ID: 32792 | Rating: 0 | rate:
![]() ![]() ![]() | |
Oh. I only know a little, but I'll share what I know. Richard knows a ton about it. That's pretty much it. Some comments: Host names, as in your "view the details of RacerX", are only visible to the machine owner when logged in to their account. All other users - including the project staff - can only see the 'HostID' number, so it's better to quote (or even link) that. <rsc_fpops_est> is a property of the workunit, and hence of all tasks (including resends) generated from it. Workunits exist as entities in their own right - there's no such thing as a 'v8.11 workunit' - although the copy that got sent to your machine might well have appeared as a 'v8.11 task'. But another user might have got it as v8.06 or v8.13 - depends which was active at the time the task was allocated to the host in question. If the test tasks (the current 'CRASH' series) are copies of SANTI-MAR4222, they will be long enough - as I think I've already said somewhere - not to cause any timing problems. A bit of distortion, sure - DCF should rise to maybe 4, but still in single figures, which will clear by itself. The problem with hugely-distorted runtime estimates arose from the doctored 'TEST' workunits, some of which only ran for one minute while still carrying a <rsc_fpops_est> more appropriate for 10 hours. So long as any of those remain in the system, we could get recurrences - whichever version of the Beta app is deployed at the time a task for the WU is issued to a volunteer. On my Beta host, it looks as if estimates for v8.11 are thoroughly borked: I suspect they will be for all active participants. If anyone still has any tasks issued with that version, they may have problems running them - and if they get aborted, and re-generated by BOINC (i.e., if there are any problems with the WU or task cancellation on the server), then the later Beta versions may get 'poisoned' too. But for the time being, v8.12 and v8.13 look clean for me. @ Matt - don't feel bad about not understanding APR. *Nobody* understands APR and everything that lies behind it. Except possibly David Anderson (who wrote it), and we're not even sure about him. Grown men (and women) have wept when they tried to walk the code... Best to wait and watch, I think, and see if the issues clear themselves up as the Beta queue tasks pass through the system and into oblivion. | |
ID: 32794 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thanks Jacob, Right, I get that. But, when that happens, it "ruins" the APR for the given app-version. So, I guess what I was getting at is: Which beta app-version is the first one that couldn't possibly have been ruined? It's okay if you don't have an answer. Edit: It sounds as if Richard is saying that the task could get reissued into an app-version and poison it, so I guess my question is a bit invalid. Anyway, after processing that 8.11 and seeing DCF/estimates jacked, I (again) closed BOINC, edited my client_state.xml file to reset the DCF, and restarted BOINC. Hopefully I don't get any more tasks that ruin app-versions. Thanks, Jacob | |
ID: 32795 | Rating: 0 | rate:
![]() ![]() ![]() | |
If I understand what you are saying, it must be 8.12, since that was the first to do only CRASH WUs MJH | |
ID: 32796 | Rating: 0 | rate:
![]() ![]() ![]() | |
Thank you Matt and Richard. | |
ID: 32797 | Rating: 0 | rate:
![]() ![]() ![]() | |
Not very sure if you still want this info. Maybe you could be more precise: | |
ID: 33944 | Rating: 0 | rate:
![]() ![]() ![]() | |
Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large