Message boards : Multicore CPUs : "This computer has finished a daily quota of 32 tasks"
Author | Message |
---|---|
My i7-8700 is left with nothing to do. | |
ID: 50601 | Rating: 0 | rate: / Reply Quote | |
My Ryzen 1700 is still busy with plenty of QC tasks… and there are many more in the queue. How can it be that your 8700 doesn't get any? This system is also a Linux based one, is it not? | |
ID: 50602 | Rating: 0 | rate: / Reply Quote | |
Yes, that is the point, there are plenty of tasks available. It seems that they just place a limit on them. I think it is to guard against machines that produce a lot of errors, but mine doesn't. I think the limit should be increased. | |
ID: 50603 | Rating: 0 | rate: / Reply Quote | |
If somebody has an idea of where the daily quota setting limit is, I'd like to hear. | |
ID: 50604 | Rating: 0 | rate: / Reply Quote | |
As you probably know, there was some discussion of it earlier, though it does not tell you much. | |
ID: 50605 | Rating: 0 | rate: / Reply Quote | |
Jim, | |
ID: 50606 | Rating: 0 | rate: / Reply Quote | |
Your last work unit reported at 07:13 with an error. I don't see any others after that. Is it possible that server put your machine in "time out" until you reported a new work unit that validates? That could be it, but I don't know. If so, they need to increase the limit, or machines will be idle too often. I don't know of any other project that shuts down the supply of work after only one error (which could happen for a variety of causes). EDIT: I keep a 0.1 + 0.5 day buffer on all my machines, which is the default. It seems to be the reverse of yours, but it should not matter much. Second EDIT: There are a couple of errors. They say: CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/pro/linux-64/repodata.json.bz2> I think this must be due to the intermittent connections and timeouts I get with GPUGrid. There may be no cure for that, but at least they could increase whatever error limits they have. | |
ID: 50607 | Rating: 0 | rate: / Reply Quote | |
I increase the daily quota because new QC jobs are short. Failures and successes will cause the quota to go up and down for your host, as per BOINC heuristics. | |
ID: 50609 | Rating: 0 | rate: / Reply Quote | |
OK, I will try it again later and see how it goes. | |
ID: 50610 | Rating: 0 | rate: / Reply Quote | |
Found my r7 1700 system idling with a daily quota of 4 hit. Why would so many WUs fail? | |
ID: 50611 | Rating: 0 | rate: / Reply Quote | |
Found my r7 1700 system idling with a daily quota of 4 hit. Why would so many WUs fail? CondaHTTPError: HTTP 503 SERVICE UNAVAILABLE: BACK-END SERVER IS AT CAPACITY for url Been seeing that the last few errors I've had. Not sure what it means. ____________ | |
ID: 50612 | Rating: 0 | rate: / Reply Quote | |
Found my r7 1700 system idling with a daily quota of 4 hit. Why would so many WUs fail? Good question. But it makes it difficult to devote an entire PC to it. You need to be running something else in case your quota is hit. I hope they can fix it. | |
ID: 50613 | Rating: 0 | rate: / Reply Quote | |
@Zalster: That conda was getting too many download requests from users so it refused to download the packages on your machine at that moment. Should work next time I assume. | |
ID: 50614 | Rating: 0 | rate: / Reply Quote | |
My system gets random CondaHTTPErrors as well. From a layman's perspective this seems to be a bottleneck. | |
ID: 50615 | Rating: 0 | rate: / Reply Quote | |
@Zalster: That conda was getting too many download requests from users so it refused to download the packages on your machine at that moment. Should work next time I assume. Yes it did, but in the meantime 40 QC units "erred out" . Only thing that saved me from a "time out" is that I had more QC units in the cache that validated later and help me avoid being locked out. I agree, it does seem like a bottleneck. If and when the Windows QC goes mainstream, I would expect to see a huge up spike in these "errors" and lockouts. ____________ | |
ID: 50616 | Rating: 0 | rate: / Reply Quote | |
Indeed the new short WUs probably contact the conda cloud too often. Even if there is no download, just checking for new versions (which I don't think we can really avoid) triggers the block. We may need to recreate the WUs as larger blocks. | |
ID: 50617 | Rating: 0 | rate: / Reply Quote | |
Indeed the new short WUs probably contact the conda cloud too often. I was about to say the same thing, though on a different basis. My Ryzen 1700, running two work units (2 cores each) has no problem with the Conda server, but each work unit usually runs over 30 minutes. My i7-8700 was churning through them at 10 minutes (or less), and got the errors. I think we need to somehow back off, and larger work units make sense to me. | |
ID: 50618 | Rating: 0 | rate: / Reply Quote | |
I'll look into making the WUs larger next week. For the weekend I don't want to break stuff so it will keep on running as is, sorry. | |
ID: 50619 | Rating: 0 | rate: / Reply Quote | |
Can you give me an estimated runtime of these WUs to know how many of them to pack together? | |
ID: 50620 | Rating: 0 | rate: / Reply Quote | |
Can you give me an estimated runtime of these WUs to know how many of them to pack together? Hello Stefan, Linked below is my r7 1700 system running at 3.9ghz with 2933mhz ram. You can see all of the run times. http://www.gpugrid.net/results.php?hostid=424454 | |
ID: 50621 | Rating: 0 | rate: / Reply Quote | |
Can you give me an estimated runtime of these WUs to know how many of them to pack together? Don't know if this link will work but here's a list of my CPU tasks http://www.gpugrid.net/results.php?userid=103037&offset=0&show_names=0&state=0&appid=30 edit.. I run 4 threads per work unit. Currently only 1 work unit per machine. 2 machines. | |
ID: 50623 | Rating: 0 | rate: / Reply Quote | |
Here is my i7-8700 | |
ID: 50625 | Rating: 0 | rate: / Reply Quote | |
Very interesting comparison of run times. Running intel myself, Ryzen seems to struggle. | |
ID: 50626 | Rating: 0 | rate: / Reply Quote | |
Ok thanks for the reports! The problem is that the WU runtime scales quadratically to the number of electrons in the molecule so larger molecules will take longer. But I assume I can go at least 5x the current length for this batch. | |
ID: 50630 | Rating: 0 | rate: / Reply Quote | |
Running intel myself, Ryzen seems to struggle. My i7-8700 was running 4 cores per work unit, whereas my Ryzen 1700 was running only 2 cores per work unit. And the Ryzen has 16 virtual cores, while the i7-8700 has only 12, so you would expect more per core from the Intel. Still, I agree that Intel is a little faster, though not be a large amount. I would be comfortable using either or both. | |
ID: 50632 | Rating: 0 | rate: / Reply Quote | |
The following is largely anecdotal, but I've found that 4-core tasks are more efficient than to two 2-core tasks. After 1 hour, 4-cores had accumulated (slightly) more credit, which includes start up time for each task and so on. My CPU does not support Hyper Threading, but it might be worth a separate test, if you're looking for best efficiency. | |
ID: 50634 | Rating: 0 | rate: / Reply Quote | |
With more cores memory and disk through put seem especially relevant for QC. That could be, especially with the new work units. I think we all should test that if possible. Thanks. | |
ID: 50635 | Rating: 0 | rate: / Reply Quote | |
Ok thanks for the reports! The problem is that the WU runtime scales quadratically to the number of electrons in the molecule so larger molecules will take longer. But I assume I can go at least 5x the current length for this batch. So I just checked and see the CPU work units are running longer. Longest so far was 1800 seconds. Are these the new work units you were talking about. Still much shorter than a GPU task. No errors so far (looks around for wood to knock on) ____________ | |
ID: 50652 | Rating: 0 | rate: / Reply Quote | |
I am getting errors on QC tasks "Disk limit exceeded". They are all SELE6. | |
ID: 50653 | Rating: 0 | rate: / Reply Quote | |
I'm starting to see those too. Just had 4 of them error out on my machine. | |
ID: 50664 | Rating: 0 | rate: / Reply Quote | |
+1 | |
ID: 50665 | Rating: 0 | rate: / Reply Quote | |
I am running SETI@home and Einstein@home on both Linux boxen and also on a Ulephone smartphone with Android 7.1.1, Atlas@home on my Windows 10 PC.Goodbye GPUGRID. | |
ID: 50668 | Rating: 0 | rate: / Reply Quote | |
The last three QC have all erred for me with "Disk usage limit exceeded" also. It is time to give it a rest until they can get it fixed, hopefully soon. | |
ID: 50669 | Rating: 0 | rate: / Reply Quote | |
The last three QC have all erred for me with "Disk usage limit exceeded" also. It is time to give it a rest until they can get it fixed, hopefully soon. Yes it appears to be getting worse. Almost all are erring out now. I say almost all, a half dozen have finished where previously they erred on other's machines. ____________ | |
ID: 50670 | Rating: 0 | rate: / Reply Quote | |
I am getting a lot of the "Disk usage limit exceeded" errors now. Was getting a few several days ago but now it is nearly all that error out. It is unclear whether the error message refers to disk capacity or frequency of disk write/reads are exceeded. It would be nice if the project folks would let us know why the error and if there is anything that we can do to reduce the probability of encountering these errors. | |
ID: 50671 | Rating: 0 | rate: / Reply Quote | |
I finally figured out a way to shrink the linux disk cache from eating all my ram leaving only < 1% free but even leaving at least 4% ram free doesn't stop the thrashing. Maybe if I spring for 32 GB on the 8 core machines with currently 16GB each, the thrashing will be reduced but that won't help the disk errors. I have a large write cache on all my Ubuntu machines, basically to protect the SSDs from the high write rates of some projects (not QC). But out of 32 GB memory on my Ryzen 1700, I set aside about 8 GB for a write cache, with a 2 hour latency. That allows all the writes to go to the main memory. It also cuts down on the amount written to the SSD, if a given memory location is over-written before the 2 hour latency period has expired. Each time I check it, there are always several GB of memory free or at least available. So, along with about 180 GB free on my SSD, I should not be exceeding any disk limits. But I allow four work units to run at a time maximum (using an app_config.xml); if I cut it down to two at a time, that might work, though I expect that the real problem is something else. | |
ID: 50672 | Rating: 0 | rate: / Reply Quote | |
Please delete. Each time I edit something, it posts a new message. | |
ID: 50673 | Rating: 0 | rate: / Reply Quote | |
Yeah, I just stopped accepting new QC work units until they figure out what the problem is. | |
ID: 50674 | Rating: 0 | rate: / Reply Quote | |
172,128 QC ready to send, 48 users. No comment. | |
ID: 50675 | Rating: 0 | rate: / Reply Quote | |
172,128 QC ready to send, 48 users. No comment. this imbalance will not change as long as there is no Windows app for QC. Too bad that it's so difficult come up with one :-( | |
ID: 50676 | Rating: 0 | rate: / Reply Quote | |
And now this: Thu 11 Oct 2018 10:10:44 AM CDT | GPUGRID | Aborting task 123_35_37_39_42_da3ae375_n00001-SDOERR_SELE6-0-1-RND5707_4: exceeded disk limit: 59944.94MB > 57220.46MB Looks like the project admins need to make some adjustments. | |
ID: 50677 | Rating: 0 | rate: / Reply Quote | |
Yes, I have also reluctantly set preferences not to accept any more production QC tasks until the disk usage problem is identified and eliminated. I had 6 machines with 16 threads (1/2 of available threads) on the project but all these WU's are just thrashing my machines and producing errors after an hour of so wasted cpu time. I am configured to run QC beta should any fixes be attempted. | |
ID: 50678 | Rating: 0 | rate: / Reply Quote | |
Tried two more QC tasks, they all fail the same way. Complete silence from admins. | |
ID: 50679 | Rating: 0 | rate: / Reply Quote | |
Decided to give it a try again. Rough estimates are 1 valid for every 4 errors. | |
ID: 50681 | Rating: 0 | rate: / Reply Quote | |
Going back to Aug24, my QC completion record shows 194 errors out of 454 QC WU's processed. This is about a 42.7% failure rate and a small random sampling of the error causes reveals almost all due to 'disk usage limit exceeded.' Would be nice to get an explanation what specifically this error means. | |
ID: 50682 | Rating: 0 | rate: / Reply Quote | |
I am running other 4 BOINC projects, both on Linux and Windows 10. Some use also GPUs, some don't but use Virtual Box, so I have a vast experience on all kind of errors. But all give me a feedback by admins or other volunteers with similar experiences. Here only silence. | |
ID: 50683 | Rating: 0 | rate: / Reply Quote | |
The error has occurred on some other projects where the disk size usage went past a limit set by the app. It wasn't a limit on the PC running the task. | |
ID: 50685 | Rating: 0 | rate: / Reply Quote | |
of the few validating, this is the biggest so far | |
ID: 50688 | Rating: 0 | rate: / Reply Quote | |
Ok I will ask Toni if he can increase the WU disk space. But at some point we will just write your whole disk full as it seems... | |
ID: 50696 | Rating: 0 | rate: / Reply Quote | |
Ok I will ask Toni if he can increase the WU disk space. But at some point we will just write your whole disk full as it seems... Thanks Stefan... I increased my SSD to 500 GB so I'm good for now but It's always easier to clone the OS to a larger SSD than it is to a smaller SSD. If we start to hit a limit on the SSD then I guess I have an excuse to look at 1 T SSD hahaha... ____________ | |
ID: 50698 | Rating: 0 | rate: / Reply Quote | |
We decided to cancel them for the moment. I might redesign it at a later point and send more sensible WUs out. Sorry for the trouble, I was a bit out these days working on finishing up a project. | |
ID: 50699 | Rating: 0 | rate: / Reply Quote | |
I am making some new ones now to send out maybe by tomorrow. | |
ID: 50700 | Rating: 0 | rate: / Reply Quote | |
I am making some new ones now to send out maybe by tomorrow. Very good. But please don't compromise the project for that. I usually have at least 180 GB free these days with the 256 GB SSDs. I know not everyone can do that, so you could try separating into small and large. | |
ID: 50701 | Rating: 0 | rate: / Reply Quote | |
It's okay because we run two separate QM projects. So I'll stop the SELE WUs and restart the QMML ones but with larger batch sizes now to avoid spamming the conda server and getting blocked. | |
ID: 50702 | Rating: 0 | rate: / Reply Quote | |
I received a notice from the server that QC tasks require 77 GB. But I have more than 700 GB available to BOINC on my two Linux boxen. No other BOINC project requires that much space. | |
ID: 50703 | Rating: 0 | rate: / Reply Quote | |
The new ones won't really require any disk space because they are smaller molecules. Although I don't know how BOINC does it, like if it reserves the maximum space per WU or it just cuts the WU if it exceeds the max space. | |
ID: 50704 | Rating: 0 | rate: / Reply Quote | |
I received a notice from the server that QC tasks require 77 GB. But I have more than 700 GB available to BOINC on my two Linux boxen. No other BOINC project requires that much space. There is a disk size limit set by the server for tasks. The error is not that your own physical disk is out of space. The task filled its allotted amount of space. The new ones won't really require any disk space because they are smaller molecules. Although I don't know how BOINC does it, like if it reserves the maximum space per WU or it just cuts the WU if it exceeds the max space. The latter. Memory and disk usage will grow while crunching as the task requires until completion, it reaches the limit or reaches the BOINC Mgr disk limit percentage set in preferences. | |
ID: 50705 | Rating: 0 | rate: / Reply Quote | |
They work for me. I have downloaded 24 work units (a quota limit), and they take only 1 GB disk space in total. I am running three at a time (4 cores each), and they are taking only about 1 GB memory each. | |
ID: 50706 | Rating: 0 | rate: / Reply Quote | |
No problems on my computers. Seem to be running ok. | |
ID: 50708 | Rating: 0 | rate: / Reply Quote | |
Great! Since the last QMML ones were so short that they spammed the conda server I made these 5 times larger (so you calculate up to 50 conformation energies in each WU, in some cases less if I didn't have 50). | |
ID: 50709 | Rating: 0 | rate: / Reply Quote | |
It seems also they give a correct progress figure, not the usual 10%. | |
ID: 50711 | Rating: 0 | rate: / Reply Quote | |
It seems also they give a correct progress figure, not the usual 10%. Indeed progress is computed on the fraction of conformations computed. | |
ID: 50712 | Rating: 0 | rate: / Reply Quote | |
Since I restarted QC, the QMML50 are running great. No memory or disk limits reached or exceeded. So far 32 in progress and 6 of 6 successful completions. They seem to do the 10% + 1.8% X 50 = 100%. Times so far are ranging a little less than 40 min to nearly two hours per WU with the longer completions on slower two thread at 2 GHz machines and the others running 4 threads at 4 GHz. | |
ID: 50713 | Rating: 0 | rate: / Reply Quote | |
I mean, the larger problematic ones are obviously of more interest but I believe we should rethink the design of those large molecules and maybe try to break them down into their constituent components so that they don't take half a hard-drive to compute. It's not trivial but it seems like it's necessary if we want to keep running them on GPUGRID. | |
ID: 50715 | Rating: 0 | rate: / Reply Quote | |
Thank you all in any case for hanging on through all the troubles of the large WUs. As long as we know you are working on it, we will work on it. A few people could do the large ones; SSDs are cheap these days, though if you need 1000 crunchers on them, then I think you are right you will need to break them down. | |
ID: 50716 | Rating: 0 | rate: / Reply Quote | |
I mean, the larger problematic ones are obviously of more interest but I believe we should rethink the design of those large molecules and maybe try to break them down into their constituent components In case the larger ones are of more interest for you.. and therefore more valuable for science... I would be interested in them even so and upgrade my machines accordingly. Maybe you could split QC into long and short runs like the GPU jobs? That would make it possible to choose. ____________ I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. | |
ID: 50717 | Rating: 0 | rate: / Reply Quote | |
Maybe you could split QC into long and short runs like the GPU jobs? That would make it possible to choose. +1 | |
ID: 50718 | Rating: 0 | rate: / Reply Quote | |
I mean, the larger problematic ones are obviously of more interest but I believe we should rethink the design of those large molecules and maybe try to break them down into their constituent components so that they don't take half a hard-drive to compute. It's not trivial but it seems like it's necessary if we want to keep running them on GPUGRID. Yes I would also be interested in helping with the larger ones. How large of an SSDs are we talking about? 1, 2, 4 Terabytes? ____________ | |
ID: 50719 | Rating: 0 | rate: / Reply Quote | |
Yes I would also be interested in helping with the larger ones. How large of an SSDs are we talking about? 1, 2, 4 Terabytes? I like people who think big. | |
ID: 50720 | Rating: 0 | rate: / Reply Quote | |
Well, not really SSD. HDD would be just fine. Right now scratch files are kept wherever BOINC's "slot" directory is (as expected from a well-behaved application). | |
ID: 50722 | Rating: 0 | rate: / Reply Quote | |
Yes I would also be interested in helping with the larger ones. How large of an SSDs are we talking about? 1, 2, 4 Terabytes? +1 Would a 2TB HDD be enough? ____________ I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. | |
ID: 50723 | Rating: 0 | rate: / Reply Quote | |
Well, not really SSD. HDD would be just fine. Right now scratch files are kept wherever BOINC's "slot" directory is (as expected from a well-behaved application). HDD cool. See amazon has a sale on WD Red 4TB NAS Hard Drive. Time to add onto the machine. ____________ | |
ID: 50724 | Rating: 0 | rate: / Reply Quote | |
Any SSD won't cut it. You'd need the expensive stuff with larger faster write buffers. Or use several HDDs (each on another connection, with several BOINC instances to spread the slots around). But really you should look into Optane, but it will cost a fortune. | |
ID: 50725 | Rating: 0 | rate: / Reply Quote | |
The argument that QMML has little value is disturbing to me. Either there is scientific justification or not. And please stick with a plan. All these abrupt changes make me question the research goal. I like to feel involved too, but the community shouldn't get to change course. | |
ID: 50726 | Rating: 0 | rate: / Reply Quote | |
Any SSD won't cut it. You'd need the expensive stuff with larger faster write buffers. If you are referring to lifetime, I run three QC work units at a time (4 cores per WU) on an i7-8700. According to iostat, they are writing about 50 GB/day. That is not excessive; the SSD should last a normal lifetime. As a matter of practice, I also use a write cache (12 GB size, 2 hour latency), though that is not really necessary to protect the SSD. But since I have 32 GB main memory, I like to use it. | |
ID: 50727 | Rating: 0 | rate: / Reply Quote | |
On Oct. 17, Stephan wrote: "The new ones won't really require any disk space because they are smaller molecules." BTW, the requirement for having 57,220.46 MB of disk space for the CPU tasks is still in effect. Does it need to be? | |
ID: 50728 | Rating: 0 | rate: / Reply Quote | |
We are up to 74 Linux users. But recent QC tasks run well on my two Linux boxen. The one with a GTX 750 Ti GPU board is also running a GPU task alongside a CPU task. All this on an Opteron 1210 of 2008 vintage. But this SUN workstation is still my main host. | |
ID: 50729 | Rating: 0 | rate: / Reply Quote | |
The one with a GTX 750 Ti GPU board is also running a GPU task alongside a CPU task. I find that I can run a GTX 750 Ti without having to reserve a core for it on my i7-8700 without any noticeable affect on the QCs running on all 12 cores. It is the efficiency of CUDA, and the card shows only about 8 to 16% CPU core usage (or about 1% for the whole CPU). It also works well for the i7-4770 machine when I use that, with somewhat higher CPU percentages. But it makes a great combination. And the 750 Ti gets all the work done in under 24 hours, while not expending much power. | |
ID: 50731 | Rating: 0 | rate: / Reply Quote | |
So I have successfully installed and tested the new WD Red 4T NAS HDD into 2 of my computers. They are up are running again. So however the project decides to proceed with the larger molecules, I hope to be prepared. | |
ID: 50737 | Rating: 0 | rate: / Reply Quote | |
the project decides to proceed with the larger molecules, I hope to be prepared. my hope is that some day, there will be a Windows version for the QC CPU tasks. | |
ID: 50738 | Rating: 0 | rate: / Reply Quote | |
There will. It's a matter of priorities and time allocation. It's close to 90% completed but right now we have 0 time to dedicate to the windows build. | |
ID: 50739 | Rating: 0 | rate: / Reply Quote | |
The final 10% always takes 90% of the time. | |
ID: 50742 | Rating: 0 | rate: / Reply Quote | |
Apologies.. Looks like I trashed around 23 CPU work units. Internet was down for most of the day and I didn't notice it until just now. Not sure if the internet being down caused the errors or if it was something with the computer but it appears to be running normal. Will losing internet after the work units download but before they start to crunch cause them to error if they can't contact the server when they start?? | |
ID: 50856 | Rating: 0 | rate: / Reply Quote | |
Long answer: Yes. | |
ID: 50857 | Rating: 0 | rate: / Reply Quote | |
Message boards : Multicore CPUs : "This computer has finished a daily quota of 32 tasks"