Advanced search

Message boards : Multicore CPUs : "This computer has finished a daily quota of 32 tasks"

Author Message
Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50601 - Posted: 26 Sep 2018 | 8:36:15 UTC

My i7-8700 is left with nothing to do.
http://www.gpugrid.net/results.php?hostid=475515

I will put it on Folding.

3de64piB5uZAS6SUNt1GFDU9d...
Avatar
Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 50602 - Posted: 26 Sep 2018 | 10:01:23 UTC

My Ryzen 1700 is still busy with plenty of QC tasks… and there are many more in the queue. How can it be that your 8700 doesn't get any? This system is also a Linux based one, is it not?
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50603 - Posted: 26 Sep 2018 | 12:57:56 UTC - in response to Message 50602.
Last modified: 26 Sep 2018 | 13:01:38 UTC

Yes, that is the point, there are plenty of tasks available. It seems that they just place a limit on them. I think it is to guard against machines that produce a lot of errors, but mine doesn't. I think the limit should be increased.

Note that my i7-8700 was running QC only, and ran through a lot of them per day. I have a Ryzen 1700 also, but run just four cores on QC (two work units running two cores each). That machine has no problem getting work, and I will let it run. But if they ever want to get their mountain of work done, they will have to let the high-productivity machines get them. The Androids won't do it.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50604 - Posted: 26 Sep 2018 | 13:46:08 UTC - in response to Message 50603.

If somebody has an idea of where the daily quota setting limit is, I'd like to hear.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50605 - Posted: 26 Sep 2018 | 13:54:58 UTC - in response to Message 50604.

As you probably know, there was some discussion of it earlier, though it does not tell you much.
http://www.gpugrid.net/forum_thread.php?id=4823

And Richard Hasselgrove (as usual) has the best handle on it:
http://www.gpugrid.net/forum_thread.php?id=4825

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50606 - Posted: 26 Sep 2018 | 14:36:14 UTC - in response to Message 50605.

Jim,

Your last work unit reported at 07:13 with an error. I don't see any others after that. Is it possible that server put your machine in "time out" until you reported a new work unit that validates?

I've not seen a limit yet on the work units for CPU. I'm running a i7 6950X and it been steadily busy since I got it running under Ubuntu. I usually get about 24 at a time which corresponds to my daily limit of 0.5 days + 0.1 extra

Z
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50607 - Posted: 26 Sep 2018 | 14:45:17 UTC - in response to Message 50606.
Last modified: 26 Sep 2018 | 14:55:29 UTC

Your last work unit reported at 07:13 with an error. I don't see any others after that. Is it possible that server put your machine in "time out" until you reported a new work unit that validates?

That could be it, but I don't know.

If so, they need to increase the limit, or machines will be idle too often. I don't know of any other project that shuts down the supply of work after only one error (which could happen for a variety of causes).

EDIT: I keep a 0.1 + 0.5 day buffer on all my machines, which is the default. It seems to be the reverse of yours, but it should not matter much.

Second EDIT:
There are a couple of errors. They say:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/pro/linux-64/repodata.json.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

I think this must be due to the intermittent connections and timeouts I get with GPUGrid. There may be no cure for that, but at least they could increase whatever error limits they have.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50609 - Posted: 26 Sep 2018 | 16:08:43 UTC - in response to Message 50607.

I increase the daily quota because new QC jobs are short. Failures and successes will cause the quota to go up and down for your host, as per BOINC heuristics.

The CondaHTTPError is a connection error between your host and Conda cloud, not GPUGRID.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50610 - Posted: 26 Sep 2018 | 16:20:59 UTC - in response to Message 50609.

OK, I will try it again later and see how it goes.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50611 - Posted: 27 Sep 2018 | 1:04:51 UTC

Found my r7 1700 system idling with a daily quota of 4 hit. Why would so many WUs fail?

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50612 - Posted: 27 Sep 2018 | 2:03:50 UTC - in response to Message 50611.

Found my r7 1700 system idling with a daily quota of 4 hit. Why would so many WUs fail?


CondaHTTPError: HTTP 503 SERVICE UNAVAILABLE: BACK-END SERVER IS AT CAPACITY for url

Been seeing that the last few errors I've had. Not sure what it means.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50613 - Posted: 27 Sep 2018 | 2:06:48 UTC - in response to Message 50611.

Found my r7 1700 system idling with a daily quota of 4 hit. Why would so many WUs fail?

Good question.
But it makes it difficult to devote an entire PC to it. You need to be running something else in case your quota is hit. I hope they can fix it.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50614 - Posted: 27 Sep 2018 | 12:57:56 UTC - in response to Message 50612.

@Zalster: That conda was getting too many download requests from users so it refused to download the packages on your machine at that moment. Should work next time I assume.

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50615 - Posted: 27 Sep 2018 | 14:33:57 UTC

My system gets random CondaHTTPErrors as well. From a layman's perspective this seems to be a bottleneck.

Are volunteers possibly risking being blacklisted by thrashing the Conda Cloud? Is there another way for the project to distribute/cache the necessary packages?

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50616 - Posted: 27 Sep 2018 | 17:25:04 UTC - in response to Message 50614.

@Zalster: That conda was getting too many download requests from users so it refused to download the packages on your machine at that moment. Should work next time I assume.


Yes it did, but in the meantime 40 QC units "erred out" . Only thing that saved me from a "time out" is that I had more QC units in the cache that validated later and help me avoid being locked out.

I agree, it does seem like a bottleneck. If and when the Windows QC goes mainstream, I would expect to see a huge up spike in these "errors" and lockouts.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50617 - Posted: 27 Sep 2018 | 18:52:39 UTC - in response to Message 50616.

Indeed the new short WUs probably contact the conda cloud too often. Even if there is no download, just checking for new versions (which I don't think we can really avoid) triggers the block. We may need to recreate the WUs as larger blocks.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50618 - Posted: 27 Sep 2018 | 20:04:13 UTC - in response to Message 50617.
Last modified: 27 Sep 2018 | 20:04:41 UTC

Indeed the new short WUs probably contact the conda cloud too often.

I was about to say the same thing, though on a different basis. My Ryzen 1700, running two work units (2 cores each) has no problem with the Conda server, but each work unit usually runs over 30 minutes. My i7-8700 was churning through them at 10 minutes (or less), and got the errors. I think we need to somehow back off, and larger work units make sense to me.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50619 - Posted: 28 Sep 2018 | 15:01:50 UTC

I'll look into making the WUs larger next week. For the weekend I don't want to break stuff so it will keep on running as is, sorry.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50620 - Posted: 28 Sep 2018 | 15:02:22 UTC

Can you give me an estimated runtime of these WUs to know how many of them to pack together?

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50621 - Posted: 28 Sep 2018 | 15:07:11 UTC - in response to Message 50620.

Can you give me an estimated runtime of these WUs to know how many of them to pack together?

Hello Stefan, Linked below is my r7 1700 system running at 3.9ghz with 2933mhz ram. You can see all of the run times.

http://www.gpugrid.net/results.php?hostid=424454

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50623 - Posted: 28 Sep 2018 | 15:59:03 UTC - in response to Message 50620.
Last modified: 28 Sep 2018 | 16:01:08 UTC

Can you give me an estimated runtime of these WUs to know how many of them to pack together?



Don't know if this link will work but here's a list of my CPU tasks
http://www.gpugrid.net/results.php?userid=103037&offset=0&show_names=0&state=0&appid=30

edit..
I run 4 threads per work unit. Currently only 1 work unit per machine. 2 machines.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50625 - Posted: 28 Sep 2018 | 16:56:21 UTC - in response to Message 50620.
Last modified: 28 Sep 2018 | 16:57:17 UTC

Here is my i7-8700
http://www.gpugrid.net/results.php?hostid=475515

They were often less than 10 minutes each, and I was running three at a time. You could pack 10 of them together insofar as I am concerned (or at least 4).

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50626 - Posted: 28 Sep 2018 | 17:48:36 UTC

Very interesting comparison of run times. Running intel myself, Ryzen seems to struggle.

Tasks take approx 10mins now but I'd prefer to crunch tasks <60mins.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50630 - Posted: 29 Sep 2018 | 8:12:26 UTC

Ok thanks for the reports! The problem is that the WU runtime scales quadratically to the number of electrons in the molecule so larger molecules will take longer. But I assume I can go at least 5x the current length for this batch.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50632 - Posted: 29 Sep 2018 | 21:46:37 UTC - in response to Message 50626.

Running intel myself, Ryzen seems to struggle.

My i7-8700 was running 4 cores per work unit, whereas my Ryzen 1700 was running only 2 cores per work unit. And the Ryzen has 16 virtual cores, while the i7-8700 has only 12, so you would expect more per core from the Intel. Still, I agree that Intel is a little faster, though not be a large amount. I would be comfortable using either or both.

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50634 - Posted: 1 Oct 2018 | 16:08:00 UTC - in response to Message 50632.

The following is largely anecdotal, but I've found that 4-core tasks are more efficient than to two 2-core tasks. After 1 hour, 4-cores had accumulated (slightly) more credit, which includes start up time for each task and so on. My CPU does not support Hyper Threading, but it might be worth a separate test, if you're looking for best efficiency.

With more cores memory and disk through put seem especially relevant for QC.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50635 - Posted: 1 Oct 2018 | 16:15:29 UTC - in response to Message 50634.

With more cores memory and disk through put seem especially relevant for QC.

That could be, especially with the new work units. I think we all should test that if possible. Thanks.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50652 - Posted: 8 Oct 2018 | 2:37:12 UTC - in response to Message 50630.

Ok thanks for the reports! The problem is that the WU runtime scales quadratically to the number of electrons in the molecule so larger molecules will take longer. But I assume I can go at least 5x the current length for this batch.


So I just checked and see the CPU work units are running longer. Longest so far was 1800 seconds. Are these the new work units you were talking about. Still much shorter than a GPU task. No errors so far (looks around for wood to knock on)
____________

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50653 - Posted: 8 Oct 2018 | 7:21:31 UTC
Last modified: 8 Oct 2018 | 7:24:14 UTC

I am getting errors on QC tasks "Disk limit exceeded". They are all SELE6.
Tullio

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50664 - Posted: 9 Oct 2018 | 18:35:01 UTC - in response to Message 50653.

I'm starting to see those too. Just had 4 of them error out on my machine.

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50665 - Posted: 10 Oct 2018 | 5:44:34 UTC

+1

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50668 - Posted: 10 Oct 2018 | 11:56:12 UTC

I am running SETI@home and Einstein@home on both Linux boxen and also on a Ulephone smartphone with Android 7.1.1, Atlas@home on my Windows 10 PC.Goodbye GPUGRID.
Tullio

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50669 - Posted: 10 Oct 2018 | 14:53:48 UTC - in response to Message 50653.

The last three QC have all erred for me with "Disk usage limit exceeded" also. It is time to give it a rest until they can get it fixed, hopefully soon.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50670 - Posted: 10 Oct 2018 | 15:50:21 UTC - in response to Message 50669.
Last modified: 10 Oct 2018 | 15:50:41 UTC

The last three QC have all erred for me with "Disk usage limit exceeded" also. It is time to give it a rest until they can get it fixed, hopefully soon.



Yes it appears to be getting worse. Almost all are erring out now. I say almost all, a half dozen have finished where previously they erred on other's machines.
____________

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50671 - Posted: 10 Oct 2018 | 16:16:08 UTC

I am getting a lot of the "Disk usage limit exceeded" errors now. Was getting a few several days ago but now it is nearly all that error out. It is unclear whether the error message refers to disk capacity or frequency of disk write/reads are exceeded. It would be nice if the project folks would let us know why the error and if there is anything that we can do to reduce the probability of encountering these errors.

Basically, there is no point in continuing to run these WU's since nearly all are erring and these SELE6 jobs are thrashing all my machines regardless of number of threads allowed. I finally figured out a way to shrink the linux disk cache from eating all my ram leaving only < 1% free but even leaving at least 4% ram free doesn't stop the thrashing. Maybe if I spring for 32 GB on the 8 core machines with currently 16GB each, the thrashing will be reduced but that won't help the disk errors. Might try on one machine to see out of curiosity.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50672 - Posted: 10 Oct 2018 | 18:43:37 UTC - in response to Message 50671.

I finally figured out a way to shrink the linux disk cache from eating all my ram leaving only < 1% free but even leaving at least 4% ram free doesn't stop the thrashing. Maybe if I spring for 32 GB on the 8 core machines with currently 16GB each, the thrashing will be reduced but that won't help the disk errors.

I have a large write cache on all my Ubuntu machines, basically to protect the SSDs from the high write rates of some projects (not QC). But out of 32 GB memory on my Ryzen 1700, I set aside about 8 GB for a write cache, with a 2 hour latency. That allows all the writes to go to the main memory. It also cuts down on the amount written to the SSD, if a given memory location is over-written before the 2 hour latency period has expired. Each time I check it, there are always several GB of memory free or at least available.

So, along with about 180 GB free on my SSD, I should not be exceeding any disk limits. But I allow four work units to run at a time maximum (using an app_config.xml); if I cut it down to two at a time, that might work, though I expect that the real problem is something else.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50673 - Posted: 10 Oct 2018 | 18:45:20 UTC - in response to Message 50672.
Last modified: 10 Oct 2018 | 18:47:12 UTC

Please delete. Each time I edit something, it posts a new message.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50674 - Posted: 10 Oct 2018 | 20:48:33 UTC

Yeah, I just stopped accepting new QC work units until they figure out what the problem is.
____________

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50675 - Posted: 11 Oct 2018 | 4:07:37 UTC

172,128 QC ready to send, 48 users. No comment.
Tullio

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,307,632,676
RAC: 29,649,566
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50676 - Posted: 11 Oct 2018 | 4:52:37 UTC - in response to Message 50675.

172,128 QC ready to send, 48 users. No comment.
Tullio

this imbalance will not change as long as there is no Windows app for QC.
Too bad that it's so difficult come up with one :-(

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,630,307,638
RAC: 18,649,274
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50677 - Posted: 11 Oct 2018 | 15:18:16 UTC

And now this:

Thu 11 Oct 2018 10:10:44 AM CDT | GPUGRID | Aborting task 123_35_37_39_42_da3ae375_n00001-SDOERR_SELE6-0-1-RND5707_4: exceeded disk limit: 59944.94MB > 57220.46MB


Looks like the project admins need to make some adjustments.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50678 - Posted: 11 Oct 2018 | 16:06:54 UTC

Yes, I have also reluctantly set preferences not to accept any more production QC tasks until the disk usage problem is identified and eliminated. I had 6 machines with 16 threads (1/2 of available threads) on the project but all these WU's are just thrashing my machines and producing errors after an hour of so wasted cpu time. I am configured to run QC beta should any fixes be attempted.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50679 - Posted: 12 Oct 2018 | 0:51:25 UTC

Tried two more QC tasks, they all fail the same way. Complete silence from admins.
Tullio

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50681 - Posted: 12 Oct 2018 | 15:36:31 UTC - in response to Message 50679.

Decided to give it a try again. Rough estimates are 1 valid for every 4 errors.

Almost all validates were after another computer errored out but not due to size limits.
So there are batches of work units out there that don't exceed the limit but failed for other reasons.

However, on my computers, almost all of my errors were size related. So the original error for this thread still exist.


____________

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50682 - Posted: 12 Oct 2018 | 17:26:14 UTC

Going back to Aug24, my QC completion record shows 194 errors out of 454 QC WU's processed. This is about a 42.7% failure rate and a small random sampling of the error causes reveals almost all due to 'disk usage limit exceeded.' Would be nice to get an explanation what specifically this error means.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50683 - Posted: 12 Oct 2018 | 18:34:40 UTC

I am running other 4 BOINC projects, both on Linux and Windows 10. Some use also GPUs, some don't but use Virtual Box, so I have a vast experience on all kind of errors. But all give me a feedback by admins or other volunteers with similar experiences. Here only silence.
Tullio

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,649,537,206
RAC: 10,341,991
Level
Tyr
Scientific publications
watwatwatwatwat
Message 50685 - Posted: 12 Oct 2018 | 20:57:10 UTC

The error has occurred on some other projects where the disk size usage went past a limit set by the app. It wasn't a limit on the PC running the task.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50688 - Posted: 13 Oct 2018 | 18:46:17 UTC

of the few validating, this is the biggest so far

http://www.gpugrid.net/workunit.php?wuid=14584472

4943 seconds credit 1279.49
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50696 - Posted: 16 Oct 2018 | 12:16:53 UTC

Ok I will ask Toni if he can increase the WU disk space. But at some point we will just write your whole disk full as it seems...

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50698 - Posted: 16 Oct 2018 | 12:50:38 UTC - in response to Message 50696.

Ok I will ask Toni if he can increase the WU disk space. But at some point we will just write your whole disk full as it seems...


Thanks Stefan...

I increased my SSD to 500 GB so I'm good for now but It's always easier to clone the OS to a larger SSD than it is to a smaller SSD. If we start to hit a limit on the SSD then I guess I have an excuse to look at 1 T SSD hahaha...
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50699 - Posted: 16 Oct 2018 | 13:47:29 UTC

We decided to cancel them for the moment. I might redesign it at a later point and send more sensible WUs out. Sorry for the trouble, I was a bit out these days working on finishing up a project.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50700 - Posted: 16 Oct 2018 | 13:56:49 UTC

I am making some new ones now to send out maybe by tomorrow.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50701 - Posted: 16 Oct 2018 | 14:44:59 UTC - in response to Message 50700.

I am making some new ones now to send out maybe by tomorrow.

Very good. But please don't compromise the project for that. I usually have at least 180 GB free these days with the 256 GB SSDs.
I know not everyone can do that, so you could try separating into small and large.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50702 - Posted: 16 Oct 2018 | 14:57:03 UTC

It's okay because we run two separate QM projects. So I'll stop the SELE WUs and restart the QMML ones but with larger batch sizes now to avoid spamming the conda server and getting blocked.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50703 - Posted: 17 Oct 2018 | 8:19:44 UTC - in response to Message 50696.

I received a notice from the server that QC tasks require 77 GB. But I have more than 700 GB available to BOINC on my two Linux boxen. No other BOINC project requires that much space.
Tullio

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50704 - Posted: 17 Oct 2018 | 8:47:22 UTC - in response to Message 50703.

The new ones won't really require any disk space because they are smaller molecules. Although I don't know how BOINC does it, like if it reserves the maximum space per WU or it just cuts the WU if it exceeds the max space.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,649,537,206
RAC: 10,341,991
Level
Tyr
Scientific publications
watwatwatwatwat
Message 50705 - Posted: 17 Oct 2018 | 10:23:58 UTC - in response to Message 50703.
Last modified: 17 Oct 2018 | 10:27:21 UTC

I received a notice from the server that QC tasks require 77 GB. But I have more than 700 GB available to BOINC on my two Linux boxen. No other BOINC project requires that much space.
Tullio


There is a disk size limit set by the server for tasks. The error is not that your own physical disk is out of space. The task filled its allotted amount of space.

The new ones won't really require any disk space because they are smaller molecules. Although I don't know how BOINC does it, like if it reserves the maximum space per WU or it just cuts the WU if it exceeds the max space.


The latter.

Memory and disk usage will grow while crunching as the task requires until completion, it reaches the limit or reaches the BOINC Mgr disk limit percentage set in preferences.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50706 - Posted: 17 Oct 2018 | 11:49:39 UTC - in response to Message 50705.

They work for me. I have downloaded 24 work units (a quota limit), and they take only 1 GB disk space in total. I am running three at a time (4 cores each), and they are taking only about 1 GB memory each.

They run for about 40 to 60 minutes on my i7-8700, which is ideal.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50708 - Posted: 17 Oct 2018 | 12:44:50 UTC

No problems on my computers. Seem to be running ok.
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50709 - Posted: 17 Oct 2018 | 14:10:25 UTC
Last modified: 17 Oct 2018 | 14:10:37 UTC

Great! Since the last QMML ones were so short that they spammed the conda server I made these 5 times larger (so you calculate up to 50 conformation energies in each WU, in some cases less if I didn't have 50).

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50711 - Posted: 18 Oct 2018 | 13:11:31 UTC - in response to Message 50709.

It seems also they give a correct progress figure, not the usual 10%.
Tullio

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50712 - Posted: 18 Oct 2018 | 13:58:48 UTC - in response to Message 50711.

It seems also they give a correct progress figure, not the usual 10%.
Tullio


Indeed progress is computed on the fraction of conformations computed.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50713 - Posted: 18 Oct 2018 | 17:54:43 UTC

Since I restarted QC, the QMML50 are running great. No memory or disk limits reached or exceeded. So far 32 in progress and 6 of 6 successful completions. They seem to do the 10% + 1.8% X 50 = 100%. Times so far are ranging a little less than 40 min to nearly two hours per WU with the longer completions on slower two thread at 2 GHz machines and the others running 4 threads at 4 GHz.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50715 - Posted: 19 Oct 2018 | 8:41:59 UTC
Last modified: 19 Oct 2018 | 8:43:08 UTC

I mean, the larger problematic ones are obviously of more interest but I believe we should rethink the design of those large molecules and maybe try to break them down into their constituent components so that they don't take half a hard-drive to compute. It's not trivial but it seems like it's necessary if we want to keep running them on GPUGRID.
Thank you all in any case for hanging on through all the troubles of the large WUs.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50716 - Posted: 19 Oct 2018 | 13:59:14 UTC - in response to Message 50715.

Thank you all in any case for hanging on through all the troubles of the large WUs.

As long as we know you are working on it, we will work on it.

A few people could do the large ones; SSDs are cheap these days, though if you need 1000 crunchers on them, then I think you are right you will need to break them down.

3de64piB5uZAS6SUNt1GFDU9d...
Avatar
Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 50717 - Posted: 19 Oct 2018 | 18:11:37 UTC - in response to Message 50715.

I mean, the larger problematic ones are obviously of more interest but I believe we should rethink the design of those large molecules and maybe try to break them down into their constituent components


In case the larger ones are of more interest for you.. and therefore more valuable for science... I would be interested in them even so and upgrade my machines accordingly.

Maybe you could split QC into long and short runs like the GPU jobs? That would make it possible to choose.
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50718 - Posted: 19 Oct 2018 | 19:26:03 UTC - in response to Message 50717.

Maybe you could split QC into long and short runs like the GPU jobs? That would make it possible to choose.


+1

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50719 - Posted: 20 Oct 2018 | 13:43:44 UTC - in response to Message 50715.

I mean, the larger problematic ones are obviously of more interest but I believe we should rethink the design of those large molecules and maybe try to break them down into their constituent components so that they don't take half a hard-drive to compute. It's not trivial but it seems like it's necessary if we want to keep running them on GPUGRID.
Thank you all in any case for hanging on through all the troubles of the large WUs.


Yes I would also be interested in helping with the larger ones. How large of an SSDs are we talking about? 1, 2, 4 Terabytes?
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50720 - Posted: 20 Oct 2018 | 13:57:17 UTC - in response to Message 50719.

Yes I would also be interested in helping with the larger ones. How large of an SSDs are we talking about? 1, 2, 4 Terabytes?

I like people who think big.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50722 - Posted: 20 Oct 2018 | 16:14:51 UTC - in response to Message 50720.

Well, not really SSD. HDD would be just fine. Right now scratch files are kept wherever BOINC's "slot" directory is (as expected from a well-behaved application).

3de64piB5uZAS6SUNt1GFDU9d...
Avatar
Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 50723 - Posted: 20 Oct 2018 | 16:53:53 UTC - in response to Message 50720.
Last modified: 20 Oct 2018 | 16:54:07 UTC

Yes I would also be interested in helping with the larger ones. How large of an SSDs are we talking about? 1, 2, 4 Terabytes?


+1

Would a 2TB HDD be enough?
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50724 - Posted: 20 Oct 2018 | 17:44:50 UTC - in response to Message 50722.

Well, not really SSD. HDD would be just fine. Right now scratch files are kept wherever BOINC's "slot" directory is (as expected from a well-behaved application).


HDD cool. See amazon has a sale on WD Red 4TB NAS Hard Drive. Time to add onto the machine.
____________

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50725 - Posted: 20 Oct 2018 | 18:39:29 UTC - in response to Message 50723.

Any SSD won't cut it. You'd need the expensive stuff with larger faster write buffers. Or use several HDDs (each on another connection, with several BOINC instances to spread the slots around). But really you should look into Optane, but it will cost a fortune.

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50726 - Posted: 20 Oct 2018 | 18:58:23 UTC - in response to Message 50715.

The argument that QMML has little value is disturbing to me. Either there is scientific justification or not. And please stick with a plan. All these abrupt changes make me question the research goal. I like to feel involved too, but the community shouldn't get to change course.

Further I fear splitting the project will just further complicate things for everyone.

And please always communicate upfront how your tasks will impact the volunteer's systems (with TBW, conda requests and so on). Volunteers should not have to find out themselves.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50727 - Posted: 20 Oct 2018 | 20:05:49 UTC - in response to Message 50725.

Any SSD won't cut it. You'd need the expensive stuff with larger faster write buffers.

If you are referring to lifetime, I run three QC work units at a time (4 cores per WU) on an i7-8700. According to iostat, they are writing about 50 GB/day. That is not excessive; the SSD should last a normal lifetime.

As a matter of practice, I also use a write cache (12 GB size, 2 hour latency), though that is not really necessary to protect the SSD. But since I have 32 GB main memory, I like to use it.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,630,307,638
RAC: 18,649,274
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50728 - Posted: 20 Oct 2018 | 22:34:51 UTC

On Oct. 17, Stephan wrote:

"The new ones won't really require any disk space because they are smaller molecules."

BTW, the requirement for having 57,220.46 MB of disk space for the CPU tasks is still in effect. Does it need to be?

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50729 - Posted: 21 Oct 2018 | 4:30:49 UTC

We are up to 74 Linux users. But recent QC tasks run well on my two Linux boxen. The one with a GTX 750 Ti GPU board is also running a GPU task alongside a CPU task. All this on an Opteron 1210 of 2008 vintage. But this SUN workstation is still my main host.
Tullio

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50731 - Posted: 21 Oct 2018 | 18:15:32 UTC - in response to Message 50729.
Last modified: 21 Oct 2018 | 18:35:36 UTC

The one with a GTX 750 Ti GPU board is also running a GPU task alongside a CPU task.

I find that I can run a GTX 750 Ti without having to reserve a core for it on my i7-8700 without any noticeable affect on the QCs running on all 12 cores. It is the efficiency of CUDA, and the card shows only about 8 to 16% CPU core usage (or about 1% for the whole CPU).

It also works well for the i7-4770 machine when I use that, with somewhat higher CPU percentages. But it makes a great combination. And the 750 Ti gets all the work done in under 24 hours, while not expending much power.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50737 - Posted: 25 Oct 2018 | 18:32:37 UTC

So I have successfully installed and tested the new WD Red 4T NAS HDD into 2 of my computers. They are up are running again. So however the project decides to proceed with the larger molecules, I hope to be prepared.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,307,632,676
RAC: 29,649,566
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50738 - Posted: 26 Oct 2018 | 5:25:49 UTC - in response to Message 50737.

the project decides to proceed with the larger molecules, I hope to be prepared.

my hope is that some day, there will be a Windows version for the QC CPU tasks.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50739 - Posted: 26 Oct 2018 | 9:41:10 UTC

There will. It's a matter of priorities and time allocation. It's close to 90% completed but right now we have 0 time to dedicate to the windows build.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50742 - Posted: 26 Oct 2018 | 13:50:20 UTC - in response to Message 50739.

The final 10% always takes 90% of the time.
Tullio

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50856 - Posted: 13 Nov 2018 | 8:10:03 UTC

Apologies.. Looks like I trashed around 23 CPU work units. Internet was down for most of the day and I didn't notice it until just now. Not sure if the internet being down caused the errors or if it was something with the computer but it appears to be running normal. Will losing internet after the work units download but before they start to crunch cause them to error if they can't contact the server when they start??
____________

AuxRx
Send message
Joined: 3 Jul 18
Posts: 22
Credit: 2,758,801
RAC: 0
Level
Ala
Scientific publications
wat
Message 50857 - Posted: 13 Nov 2018 | 17:52:01 UTC - in response to Message 50856.

Long answer: Yes.

Post to thread

Message boards : Multicore CPUs : "This computer has finished a daily quota of 32 tasks"

//