Advanced search

Message boards : Multicore CPUs : New QC app

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 49759 - Posted: 2 Jul 2018 | 10:18:54 UTC
Last modified: 2 Jul 2018 | 10:52:36 UTC

Dears, after a hot weekend during which I accidentally cancelled QC WUs, we are ready to start again with a new app. As soon as we get it right, we should be able to run on more machines (gcc no longer a requirement).

There will be a largish download the first time you run app 329. If you want to free up some disk space, please reset the project (recommended, but not urgent).

kain
Send message
Joined: 3 Sep 14
Posts: 152
Credit: 826,175,036
RAC: 4,076,927
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49760 - Posted: 2 Jul 2018 | 11:29:25 UTC

Ready for action ;)

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 49766 - Posted: 2 Jul 2018 | 16:30:41 UTC

First 330 task completed and validated. The second one is waiting for memory alongside a GPU task, which is however using only 4% of the total 8 GB of RAM.
Tullio

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 49778 - Posted: 4 Jul 2018 | 8:47:07 UTC - in response to Message 49766.

I will let the WUs run out for a day because I want to see if something weird is happening on my side (the WUs are calculating fine, don't worry). I'll submit more once they are completed tomorrow.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,835,616,430
RAC: 19,791,081
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50348 - Posted: 30 Aug 2018 | 13:43:18 UTC

@ Toni, @ Stefan

There's an error report in Number Crunching (This computer has finished a daily quota of 31 tasks) which suggests that the maximum upload size for a batch of QC tasks has been set too low.

Task name is 6955_1_15_16_18_dd130713_n00001-SDOERR_SELE2-0-1-RND2528

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,835,616,430
RAC: 19,791,081
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50349 - Posted: 30 Aug 2018 | 13:43:42 UTC

@ Toni, @ Stefan

There's an error report in Number Crunching (This computer has finished a daily quota of 31 tasks) which suggests that the maximum upload size for a batch of QC tasks has been set too low.

Task name is 6955_1_15_16_18_dd130713_n00001-SDOERR_SELE2-0-1-RND2528

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50404 - Posted: 5 Sep 2018 | 14:34:10 UTC - in response to Message 50349.
Last modified: 5 Sep 2018 | 14:34:49 UTC

I think the actual error is a segmentation fault, which leaves large temporary files behind, and their transfer is attempted. Let's see if the situation improves with the new version.

If in doubt, please reset the project.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50412 - Posted: 6 Sep 2018 | 5:32:45 UTC - in response to Message 50404.
Last modified: 6 Sep 2018 | 5:33:01 UTC

Just curious, is there someplace that tells how well the QC apps are running like on the server status page. I can look at my own machines and see the errors there but overall is there someplace like for the GPU apps?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,835,616,430
RAC: 19,791,081
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50415 - Posted: 6 Sep 2018 | 8:25:58 UTC - in response to Message 50404.

I think the actual error is a segmentation fault, which leaves large temporary files behind, and their transfer is attempted. Let's see if the situation improves with the new version.

BOINC won't attempt to upload a temporary file unless its name is specified with an upload URL in the workunit template.

I'll be able to advise better when you release the Windows app, and I can see any problems happening on my own machines.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50417 - Posted: 6 Sep 2018 | 8:46:57 UTC - in response to Message 50415.

May well be restricted to a few machines.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,835,616,430
RAC: 19,791,081
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50418 - Posted: 6 Sep 2018 | 9:42:25 UTC - in response to Message 50417.

May well be restricted to a few machines.

If you want to count me in, I'll do my best to report on any issues that may arise (in beta mode if necessary). I'm primarily Windows 7, so the WSL approach would be difficult except for one dual-boot test machine with Windows 10 ready to run.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50419 - Posted: 6 Sep 2018 | 10:21:20 UTC

My main Linux box is crunching SELE6 on its Opteron 1210. If necessary, I have a Windows 10 PC with an AMD A10-6700 and 22 GB RAM.
Tullio
____________

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50420 - Posted: 6 Sep 2018 | 12:24:27 UTC

again

<message>
upload failure: <file_xfer_error>
<file_name>5516_14_15_18_19_8125b500_n00001-SDOERR_SELE6-0-1-RND8343_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


Have i had to reset the project?

Profile Chilean
Avatar
Send message
Joined: 8 Oct 12
Posts: 98
Credit: 385,652,461
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50421 - Posted: 6 Sep 2018 | 12:26:27 UTC

CPU apps not working very well. Lots of idle time. It maybe because of the 48 threads... some pythons use 4 cores, the rest just 1 and not all the time. It's gotta be an I/O issue (?).
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50422 - Posted: 6 Sep 2018 | 13:18:05 UTC - in response to Message 50421.

Each task should use 4 threads max.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50423 - Posted: 6 Sep 2018 | 15:16:03 UTC - in response to Message 50420.

again
<message>
upload failure: <file_xfer_error>
<file_name>5516_14_15_18_19_8125b500_n00001-SDOERR_SELE6-0-1-RND8343_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


Have i had to reset the project?


Another clue: this happens when i reboot the virtual machine (and the wu restarts)

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50424 - Posted: 6 Sep 2018 | 17:11:33 UTC
Last modified: 6 Sep 2018 | 17:13:33 UTC

I have had good luck for the past day running QC on my i7-4770 (Ubuntu 16.04). That doesn't prove much, except that there is no fatal flaw in all the work units.

And I am limiting them to two cores per work unit, and three work units at a time. That gives me essentially the same output as four cores on two work units at a time, but leaves over a little more CPU support for my GTX 1070 on Folding. All in all, it seems to be working fine.

(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,835,616,430
RAC: 19,791,081
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50425 - Posted: 6 Sep 2018 | 17:32:31 UTC - in response to Message 50423.

(and the wu re-starts)

I think that is indeed a clue. One of the mechanisms I considered was a WU re-starting, and appending a second result to an existing file, doubling its size.

Checking the 'headroom' between the typical result file size and <max_nbytes> was one of the tests I had in mind for the Windows version. Can anyone comment?

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50426 - Posted: 6 Sep 2018 | 18:15:16 UTC

Most LHC users are Windows users and they run Scientific Linux research programs from CERN (not BOINC programs) using Virtual Machines.
Tullio

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50427 - Posted: 7 Sep 2018 | 11:41:06 UTC
Last modified: 7 Sep 2018 | 11:42:54 UTC

This is interesting. Twice overnight my PC crashed. Shut down. Didn't run.
But each time, I saw no errors in the BoincTasks log, or in the Folding log either. The Folding work unit just continued from where it left off.

But each time, a new set of (three) QC tasks started after starting up the PC.

And now I see the same old error message in the stderr.txt file:

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>2360_16_18_20_21_e0c95459_n00001-SDOERR_SELE6-0-1-RND3935_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>

http://www.gpugrid.net/result.php?resultid=18686175

The messages don't appear immediately after the crashes, but a few hours later.
And I just attached this machine to GPUGrid a couple of days ago. So it didn't take long.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 50430 - Posted: 7 Sep 2018 | 13:45:25 UTC - in response to Message 50424.

(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.)


Do you think i have to use a physical linux machine?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50431 - Posted: 7 Sep 2018 | 13:59:07 UTC - in response to Message 50430.

(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.)


Do you think i have to use a physical linux machine?

Don't know. Real Linux hasn't fixed the problem for me. But the VM's just add another layer of complexity, and from what I understand (I am not an expert), hide the various other problems. At least that is what they say on the LHC forum, where they would like to get away from VirtualBox if that were possible. It is why they developed native ATLAS, and would like to do that for the other projects if it were possible.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50432 - Posted: 7 Sep 2018 | 14:42:04 UTC - in response to Message 50431.
Last modified: 7 Sep 2018 | 14:48:30 UTC

I am running GPUGRID,both CPU an GPU, on a SuSE Linux box with a GTX 750 Ti GPU board. I am running Atlas@home of LHC, on a Windows 10 PC with 4 cores (but Task Manager says two cores and 4 logical processors on a AMD A10-6700 CPU). It has a GTX 1050 Ti GPU board, but GPUGRID overheats it to 80 C and it crashes, so I am running Atlas (no GPU but VirtuaBox 5.2.18), Einstein@home and SETI@home, both CPU and GPU.
Tullio
Atlas native runs only on Ubuntu Linux,it does not run on my SuSE Linux nor on Windows.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50433 - Posted: 7 Sep 2018 | 17:25:03 UTC - in response to Message 50432.

Restarts may be a problem. It's not really the output file size (which should be small), but the fact that temporary files are not deleted as the consequence of some other error.

Can someone reliably reproduce the problem (i.e., stopping and restarting a WU)? If it does, is the problem solved enabling the "keep application in memory" option?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50434 - Posted: 7 Sep 2018 | 17:59:00 UTC - in response to Message 50433.
Last modified: 7 Sep 2018 | 18:32:00 UTC

If it does, is the problem solved enabling the "keep application in memory" option?

I don't think I can reliably reproduce it, but I have "Leave application in memory" enabled (as is my usual practice), and that does not prevent it.

EDIT: Also, I should point out that there were no other BOINC applications running, and my machine runs 24/7, so the QC work units were never being suspended anyway.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50448 - Posted: 8 Sep 2018 | 16:46:20 UTC
Last modified: 8 Sep 2018 | 16:47:25 UTC

So I'm curious. When we see the results for a work unit for QC it list both a run time and a cpu time. Since we are so used to run time and cpu time beling a linear measurement, I was wondering if CPU time for the QC is actually a combined sum total for all CPU threads being used.

I've calculated out how long the CPU time is and it's much higher than the actual run time I'm seeing on the machine. What does make sense is if it's the sum total of 4 threads all running at the time time for a set amount of time.

CPU time = N threads x actual run time

Is this correct? Mostly for my own curiosity.
____________

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50449 - Posted: 8 Sep 2018 | 16:58:35 UTC

Since the utilization of the CPU is so low on these WUs I presume it doesn't count most of the run time as CPU time because the CPU is waiting for the hard drive.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50451 - Posted: 9 Sep 2018 | 14:22:48 UTC
Last modified: 9 Sep 2018 | 14:58:04 UTC

I thought I would try my Ryzen 1700 on QC (Ubuntu 18.04), to see if it would behave differently than my Intel machines (i7-4770, i7-8700).
The first work unit errored:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/r/noarch/repodata.json.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

If your current network has https://www.anaconda.com blocked, please file
a support request with your network engineering team.

ConnectionError(MaxRetryError("HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/r/noarch/repodata.json.bz2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f60af4b15c0>: Failed to establish a new connection: [Errno -2] Name or service not known',))",),)

A reportable application error has occurred. Conda has prepared the above report.
Upload did not complete.10:07:33 (1830): /usr/bin/flock exited; CPU time 11.796829
10:07:33 (1830): app exit status: 0x1
10:07:33 (1830): called boinc_finish(195)


This may be a somewhat different error message than the others, but it seems to me that they are all communications-related. I suspect it has to do with the intermittent connections I have been getting to GPUGrid for the past several weeks/months, as previously discussed.
http://www.gpugrid.net/forum_thread.php?id=4806

EDIT: The next two are running OK, and it looks like they will complete normally. It is a very intermittent problem.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50453 - Posted: 10 Sep 2018 | 4:20:16 UTC - in response to Message 50448.

So I'm curious. When we see the results for a work unit for QC it list both a run time and a cpu time. Since we are so used to run time and cpu time beling a linear measurement, I was wondering if CPU time for the QC is actually a combined sum total for all CPU threads being used.

I've calculated out how long the CPU time is and it's much higher than the actual run time I'm seeing on the machine. What does make sense is if it's the sum total of 4 threads all running at the time time for a set amount of time.

CPU time = N threads x actual run time

Is this correct? Mostly for my own curiosity.


I suspect that cpu time = (0.95 X N cpu's X run time + any time the computer is in use and background overhead).

These darn newer WU's are so memory hungry that latency on all my machines running them is so long, the computers are becoming useless to me. I may have to quit running these until the memory issue is addressed. Both my FX 8 cores have 16GB ram and with one QC job running just 4 cores, my swap usage is as high as 7% and it takes forever to get the machine to do what I need. Plus I had to repartition the root dirs on 6 machines to accomodate the increased HD demands.

Not sure how much longer I can hold out.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50454 - Posted: 10 Sep 2018 | 7:35:24 UTC - in response to Message 50453.

The QC jobs should use 4GB of RAM each. If you are swapping just don't run as many in parallel. You will never finish them anyway if you end up swapping.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50464 - Posted: 10 Sep 2018 | 18:55:35 UTC - in response to Message 50454.

The QC jobs should use 4GB of RAM each. If you are swapping just don't run as many in parallel. You will never finish them anyway if you end up swapping.


Yes, understood and confirmed however:

These last two posts by me are provided simply as FYI hopefully for the benefit of the project and are a summary of my experiences crunching the QC WU's, not a complaint.

Before these newer QC WU's, I was able to run two mt jobs 4 threads each (following fixing the simultaneous start bug) and acemd or E@H concurrently on my two FX machines without any memory or latency issues. With these newer jobs, I can only run a single WU 4 threads (WCG and acemd on the remaining cores). After about 12 hours or so of run time, the swap file begins to be utilized and of course latency (my intervention to use the computer) starts increasing up to seconds before responding. I have not found out what application(s) actually use the extra ram causing the swap to be invoked (probably not the QC app because they finish quickly, usually around 15 - 30 mins) but the swap usage gradually increases with time with 7% usage being highest observed to date. The swap does not appear to be utilized consistently but rather for short increments of time even when I am not using the machine. Is it possible that the QC app is returning most but not all memory it uses back to the memory pool as calc's are completed?

I have 6 computers running QC with 4 being 4 core headless crunchers only and as long as they provide valid results I leave them alone but the two FX machines with consoles, the latancy issue leaves me with little choice but to consider stop running the QC apps on the FX's, or perhaps cut them down to 2 or 3 threads to see if that works. I will try the latter before I stop QC on the FX's but that is going increase turnaround time and undo the benefits of using the extra ram and multi-cores to speed turnaround time seemingly causing a catch 22 situation (2 concurrent WU's 4 threads each taking longer but returning 2 WU's/real time vs running a single WU 2 threads quicker but returning 1 WU/real time).

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50465 - Posted: 10 Sep 2018 | 20:17:36 UTC - in response to Message 50464.

Is it possible that the QC app is returning most but not all memory it uses back to the memory pool as calc's are completed?

That is an interesting question, and could explain some of the random errors I have been getting. But QC is running OK now on an i7-3770, running four work units at a time with 2 cores per WU. I see memory usage up around 4 GB per work unit though, so it is fortunate I have 32 GB. That has not prevented the errors in the past on comparable machines, but I have found that when it works, don't touch it.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50478 - Posted: 12 Sep 2018 | 2:43:32 UTC - in response to Message 50465.


That is an interesting question, and could explain some of the random errors I have been getting. But QC is running OK now on an i7-3770, running four work units at a time with 2 cores per WU. I see memory usage up around 4 GB per work unit though, so it is fortunate I have 32 GB. That has not prevented the errors in the past on comparable machines, but I have found that when it works, don't touch it.


Same here, don't mess with a working situation. I should have gone full 32 GiB when I last upgraded RAM. Darn, I went 4 x 2 initially and later added 4 x 2 more so now I have to buy all 32 G (8 x 4) rather than just add 16 G more, to the tune of around 250 USD each FX box. My rule has been 2 G per thread/core but in this situation 4 G /thread appears minimum, in fact, all my ATX can take.

While running 4 WU's 2 threads each, have you noticed any swap file usage with the 32 G RAM?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50480 - Posted: 12 Sep 2018 | 10:29:26 UTC - in response to Message 50478.
Last modified: 12 Sep 2018 | 10:33:36 UTC

While running 4 WU's 2 threads each, have you noticed any swap file usage with the 32 G RAM?

I have set swappiness to never use swap: sudo sysctl vm.swappiness=0

But I don't think I would notice it anyway, since it is a dedicated machine and I don't have a way to check it. But whenever I run "free", I always see plenty of free/available memory.

Currently, it is 3 GB free, and 22 GB available, but it varies a lot. However, I haven't seen less than 18 GB available.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50481 - Posted: 12 Sep 2018 | 10:54:04 UTC - in response to Message 50478.

I should have gone full 32 GiB when I last upgraded RAM. Darn, I went 4 x 2 initially and later added 4 x 2 more so now I have to buy all 32 G (8 x 4) rather than just add 16 G more, to the tune of around 250 USD each FX box. My rule has been 2 G per thread/core but in this situation 4 G /thread appears minimum, in fact, all my ATX can take.

Just leave it at the default 4 cores per work unit. I need the extra memory only because I am using 2 cores per work unit, and then running 4 work units at a time.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50482 - Posted: 12 Sep 2018 | 18:28:50 UTC - in response to Message 50481.

Just leave it at the default 4 cores per work unit.


Experimenting, I went to 3 cores with only 1 QC job at a time and that helped just about eliminate the user latency issues. I can live with it now but as expected, the real times increased. Now, after a fresh boot, I have plenty of free memory but over time it starts pushing toward the limit. Would be interesting to find out if the QC app is faithfully returning all memory used back to the system after each of the calc's.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50483 - Posted: 12 Sep 2018 | 21:43:24 UTC - in response to Message 50482.

Would be interesting to find out if the QC app is faithfully returning all memory used back to the system after each of the calc's.

Even though I show 3 GB memory free, and 21 GB available at the moment, it still shows that 361 MB of the swap file is used (out of 2 GB total), even though I have swappiness set to 0. I don't know what that means, but the machine has not been rebooted for three days, and still has plenty of memory left. But it could be using it up.

(It is getting hard to post again, due to all the browser timeouts. It is a wonder I am able to get work at all.)

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50484 - Posted: 12 Sep 2018 | 22:03:29 UTC - in response to Message 50483.
Last modified: 12 Sep 2018 | 22:06:22 UTC

(It is getting hard to post again, due to all the browser timeouts. It is a wonder I am able to get work at all.)


I hear you re the website. I have been having the same issues for several days with the browser timeouts and having to reesend data just to complete a post.

On the swap file issue on your machine, try swapoff -a as root or sudo. That for sure disables swapping. I use it to flush the swap. Use swapon -a to re-enable swap.

Edit: I use a RPM disto so not sure if swapoff is available to a DEB linux.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50485 - Posted: 12 Sep 2018 | 22:13:33 UTC - in response to Message 50484.

On the swap file issue on your machine, try swapoff -a as root or sudo. That for sure disables swapping. I use it to flush the swap. Use swapon -a to re-enable swap.

Very good. I did sudo swapoff -a, and now swap shows as zero. We will see how it goes.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50486 - Posted: 13 Sep 2018 | 3:53:03 UTC
Last modified: 13 Sep 2018 | 3:54:16 UTC

I have a GPU task and a CPU task running on my two cores Opteron 1210, 8 GB DDR2 RAM, GTX 750 Ti at 1202 MHz clock, 61 C. OS is SuSE Leap 42.3.Swap space is used at 37% of 2 GB. My HP Linux laptop, also running SuSE Leap 15.0 has a 7 GB swap space, not used. It is not running GPUGRID tasks because BOINC space is only 30 GB instead of the 760 GB of my main Linux host, a 2008 vintage SUN workstation, running 24/7 since January 2008. Hats off to SUN!
Tullio

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50490 - Posted: 13 Sep 2018 | 15:19:34 UTC - in response to Message 50485.

Very good. I did sudo swapoff -a, and now swap shows as zero. We will see how it goes.


Remember when you reboot, you will have to use the command again unless you write a script and set it up to auto run at boot.

I have a GPU task and a CPU task running on my two cores Opteron 1210, 8 GB DDR2 RAM, GTX 750 Ti at 1202 MHz clock, 61 C. OS is SuSE Leap 42.3.Swap space is used at 37% of 2 GB. My HP Linux laptop, also running SuSE Leap 15.0 has a 7 GB swap space, not used.


How is your user latency (I define as delay between user requests and computer response)? One of my FX computers has been running QC jobs about 20 hours since a fresh reboot and already swap usage is at 4% (333 of 8047 MiB) and I am only using 3 threads of 8 available. I open boincmgr and it is taking up to 10 seconds to communicate to localhost. I am really suspecting this newer QC app is leaving some small percentage of dirty ram pages behind after completing the calc's and they add up over time to swap usage and user latency.

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 50491 - Posted: 13 Sep 2018 | 16:03:35 UTC - in response to Message 50490.
Last modified: 13 Sep 2018 | 16:24:42 UTC

I am using this computer to read my mail, to navigate the WWW, read the newspapers including the NYTimes which leaves me ten free articles/month and I feel no delay. I have a 30 Mbit/s mixed fiber/copper connection by Telecom Italy, which means the the fiber reaches a cabinet that is not far from my home, then I have a copper connection. My router is also connected by WiFi to a Windows 10 PC, a printer and a decoder which gives me SKY TV on the TV set which is also the monitor of the Windows PC. I just had a Microsoft update on the Windows 10 PC and it restarted with its two Einstein@home tasks.
Tullio
I have also a smartphone running Android 7.1.1 on its 8 cores 64ARM CPU, connected by WiFi to the router. It is running SETI@HOMe and Einstein@home CPU tsks.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50492 - Posted: 13 Sep 2018 | 17:24:36 UTC - in response to Message 50491.

Concerning returning the memory when the WU is over: that's sure. The OS enforces that.

Concerning swapoff : I don't recommend that. A small amount of swap use is normal. If you remove this "safety valve", the only thing that will happen is that processes will just fail (often in confusing ways).

My suggestion is that you pay attention to your system's performance during heavy boinc use. If it becomes sluggish/irritating/unusable, swap use is probably going up -- you'll have to run fewer tasks simultaneously. Removing the swap will make the system kill them.

This said, QC tasks come in various sizes, so you may hit an "unfortunate" combination of large ones.

T[/quote]

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 50493 - Posted: 13 Sep 2018 | 17:43:26 UTC

Doesn't happen alot but i still am getting these occasionally that cause errors.


</stderr_txt>
<message>
finish file present too long
</message>
]]>

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50494 - Posted: 13 Sep 2018 | 17:45:07 UTC - in response to Message 50493.

General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1341
Credit: 7,659,123,724
RAC: 13,383,585
Level
Tyr
Scientific publications
watwatwatwatwat
Message 50500 - Posted: 13 Sep 2018 | 21:32:53 UTC - in response to Message 50493.

Doesn't happen alot but i still am getting these occasionally that cause errors.


</stderr_txt>
<message>
finish file present too long
</message>
]]>

Those are my bane too. Nothing can be done about them. Normal feedback is to don't quit BOINC just as a task is finishing up. But it happens even when you never quit BOINC. It can happen for whatever reason that one task finishes up just as another project's task starts and BOINC takes too long to report the task. So the error.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50502 - Posted: 13 Sep 2018 | 23:26:43 UTC
Last modified: 13 Sep 2018 | 23:31:24 UTC

Toni asked:

General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps?


Just started my first QC app on windows. About 1 minute after the task started, a "black window" flashed up on the display then immediately disappeared.


About 15 minutes after the task started, there is a "python" app listed in windows task manager that I assume is the QC app. It is using all available threads on a 16 thread system. In BOINC Manager, the task shows that it should be using "4 CPU's".


Let me know if you need more info.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 50505 - Posted: 14 Sep 2018 | 10:42:17 UTC - in response to Message 50502.
Last modified: 14 Sep 2018 | 10:43:18 UTC

Toni asked:

General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps?


Just started my first QC app on windows. About 1 minute after the task started, a "black window" flashed up on the display then immediately disappeared.


About 15 minutes after the task started, there is a "python" app listed in windows task manager that I assume is the QC app. It is using all available threads on a 16 thread system. In BOINC Manager, the task shows that it should be using "4 CPU's".


Let me know if you need more info.


Ok thanks. If the flashing is not annoying, I'd leave it as it is.

Regarding threads - the python app is indeed QC. Are you running multiple WUs simultaneously, or just one WU occupied all 16 threads?

Thanks a lot

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50508 - Posted: 14 Sep 2018 | 12:48:36 UTC

Toni asked:

Regarding threads - the python app is indeed QC. Are you running multiple WUs simultaneously, or just one WU occupied all 16 threads?

I was running 1 task that took all threads. I will try to get another task and run it when one becomes available.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50588 - Posted: 21 Sep 2018 | 15:49:54 UTC

Just had 8 tasks error out with the following error code.

SafetyError: The package for hdf5 located at /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/pkgs/hdf5-1.10.2-hba1933b_1
appears to be corrupted. The path 'lib/libhdf5.so.101.1.0'
has a sha256 mismatch.
reported sha256: d8628337423317dafe2d7f1f5029bfbb5cd22428fbf97e81678cc0db0e93c2c2
actual sha256: 6ef2e91ed97113943149adc62ad53c6d831db045c015176b13d3f7fd8f9e1c0f



I tried a reset on the project but that computer is now locked out for the rest of the day. Anything else I need to do?

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51019 - Posted: 10 Dec 2018 | 14:23:29 UTC

Looks like we have made headway on these work units. Just over a thousand left. Will be interesting to see if they bring back the other set.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51020 - Posted: 10 Dec 2018 | 16:05:34 UTC - in response to Message 51019.

Will be interesting to see if they bring back the other set.

I could use some longer ones, as long as the disc space problem is solved.
(I have 170 GB free, so it should not be a problem, though that remains to be seen.)

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51021 - Posted: 11 Dec 2018 | 9:20:21 UTC

Sorry for running out. Give me some time. I'm at a conference so it's not super easy to spawn new work units right now but I'll give it a try.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51022 - Posted: 11 Dec 2018 | 11:36:17 UTC

Actually I think I will let them run out to not resubmit the same workunits again. I will send new ones out tomorrow or the day after.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51023 - Posted: 11 Dec 2018 | 16:51:43 UTC - in response to Message 51022.

No worries, it was more of a comment than a compliant. Enjoy your conference, we aren't going anyplace hahaha...
____________

kain
Send message
Joined: 3 Sep 14
Posts: 152
Credit: 826,175,036
RAC: 4,076,927
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51024 - Posted: 11 Dec 2018 | 18:08:20 UTC

Are you give us any comment about results of this batch? Was it usefull?

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51040 - Posted: 14 Dec 2018 | 14:00:23 UTC

@kain The easy answer is "yes". This is a machine learning project so the more data you throw at it the better the predictor becomes. If you want to know if we solved the problem we set out to solve then the answer is "not totally yet" although we will publish something on it very soon.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51170 - Posted: 1 Jan 2019 | 10:51:03 UTC

I'm going to let the workunits run out again to make sure I don't recalculate them. In a few days I'll send out new ones.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51171 - Posted: 1 Jan 2019 | 14:58:07 UTC - in response to Message 51170.

Ok, Thanks Stefan. Happy New Years
____________

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51200 - Posted: 7 Jan 2019 | 7:58:38 UTC

I everything goes smooth I ought to have the last workunits out by tomorrow, or latest the day after.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51203 - Posted: 7 Jan 2019 | 9:57:51 UTC - in response to Message 51200.

I everything goes smooth I ought to have the last workunits out by tomorrow, or latest the day after.

No problem. We are ready when you are. But the term "the last" is ominous. Is it the end of the road for QC, or just for this experiment?

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51204 - Posted: 7 Jan 2019 | 11:19:17 UTC - in response to Message 51203.
Last modified: 7 Jan 2019 | 11:19:26 UTC

No, I think there will be more in the future. But we will have to consider well what to run since the large molecules were a bit of a pain. Some molecule fragmentation might solve the issue. But there might be some downtime on QM jobs, hard to tell.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51209 - Posted: 7 Jan 2019 | 17:51:58 UTC - in response to Message 51204.

That is perfectly OK. It is much better for projects to tell us what is going on than to leave us hanging. You never know if they are dead or alive. At least you are alive.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51214 - Posted: 8 Jan 2019 | 12:55:02 UTC

So, seems like it will take 2 more days or so since other people are running stuff on the machine that calculates the workunits. It's a bit of a massive amount of data so bear with it a few more days

3de64piB5uZAS6SUNt1GFDU9d...
Avatar
Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 51215 - Posted: 8 Jan 2019 | 19:38:42 UTC - in response to Message 51214.

which machine do you use for calculating the WU? Just in case some of us are willing to donate some money for hardware upgrades at your institute.
____________
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51219 - Posted: 9 Jan 2019 | 7:42:20 UTC - in response to Message 51215.

Although I love the support, I feel like it's an exercise in futility. Eventually all machines we have become overrun by one or more of our lab members. No machine is safe :D Even if I had one dedicated to myself I might use it for something else while not calculating WUs (would be a shame to have it idle) and then when we run out of WUs I would need to deal with the same issue. I guess we'll just have to wait one or two more days. It's roughly 80% done. Worst case if it's still not done in two days I'll talk with the lab member currently squeezing the living juices out of it to stop his jobs for a moment if possible.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51220 - Posted: 9 Jan 2019 | 17:26:02 UTC

Does QC still have a problem when multiple WUs start at the same time???

I loaded Rosetta jobs and watched them in System Monitor. They appeared to load one by one. The first CPU ramped up to 100% and then a minute or so later the second ramped up... I wonder if they have a solution they'd share.

I'm thinking of a script that could replicate that behavior.
QC WUs DL. The app_config.xml file could be iteratively modified:

<app>
<name>QC</name>
<max_concurrent>1</max_concurrent>
</app>

Wait a minute and change 1 to 2 and tell BOINC to Read Config files.
<app>
<name>QC</name>
<max_concurrent>2</max_concurrent>
</app>

Wait a minute and change 2 to 3 and tell BOINC to Read Config files.
Repeat until max desired WUs is reached.
Reset N to 1 in case of a reboot you restart one at a time.

E.g., I have a Xeon E5-2699 v4 with 22c/44t. I could conceivably run ten 4-CPU WUs and leave 4 threads for the GPUs.

When the next batch of QC WUs posts I'm going to babysit a computer and do this manually. I've never written a script so I'm reading book learning it now. If someone knows how to code this I'd be glad to test drive their script for them.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51228 - Posted: 9 Jan 2019 | 22:32:55 UTC

Aurum asked:

Does QC still have a problem when multiple WUs start at the same time???

Short answer is No. This was fixed several weeks ago.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51229 - Posted: 9 Jan 2019 | 23:39:06 UTC

Great. How much RAM do I need per QC WU???
Rosetta needs between 750 MB to 1 GB.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51231 - Posted: 10 Jan 2019 | 6:09:53 UTC

Is there an L3 Cache requirement for QC WUs???

E.g. at WCG the MIP project require 4-5s MB per WU. It runs with less but speed drops fast if overloaded.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51232 - Posted: 10 Jan 2019 | 7:38:08 UTC

Today I'll upload a new batch. I have no real clue on the technical requirements though. It will be try-and-see because these are some last large molecules I need to run. After these large ones I'll send some last small and fast ones again.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51234 - Posted: 10 Jan 2019 | 12:45:49 UTC
Last modified: 10 Jan 2019 | 12:53:48 UTC

Wow! This is a major RAM hog. I just watched while 6 4C WUs consumed 16 GB and crashed a computer. Then another had 3 4C WUs running and it stopped one them as Suspended:Wating for memory, then a second one and third is still using 12 of 16 GB. Then they all 3 come back on and use all the memory. Then start failing for computation error.

We may need to reserve 3 GB per C, i.e. run only one 4C WU per 16 GB RAM.
So I'm adding this to my app_config:

<app>
<name>QC_beta</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>QC_beta</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
<cmdline>--fetch_minimal_work</cmdline>
</app_version>

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51235 - Posted: 10 Jan 2019 | 13:23:19 UTC - in response to Message 51232.
Last modified: 10 Jan 2019 | 13:23:56 UTC

I got two of them. One took 2.5 GB, and the other 3.5 GB. Send more. I need to use the 32 GB of memory on my i7-4770.
And longer ones are OK; these took only 12.75 minutes. I have no idea about disk usage though.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 51236 - Posted: 10 Jan 2019 | 13:36:30 UTC - in response to Message 51235.

Still beta WU for now. Disk usage may be large (and we can't tune it). Memory should be <4 GB per WU. Threads, up to 4, controlled by boinc. There will be more. Thanks!

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51237 - Posted: 10 Jan 2019 | 13:56:14 UTC

I have some TONI WUs that have very low CPU usage and very low RAM usage.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51238 - Posted: 10 Jan 2019 | 15:11:19 UTC
Last modified: 10 Jan 2019 | 15:15:01 UTC

Another thing I noticed is my GPU usage, normally at 100% running GPUGrid is now 5-15% and the CPU is not anywhere near 100% usage.

Update:

I just suspended and resumed the GPU WU and now it's back at 100% usage

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51239 - Posted: 10 Jan 2019 | 15:49:44 UTC - in response to Message 51232.

Today I'll upload a new batch. I have no real clue on the technical requirements though. It will be try-and-see because these are some last large molecules I need to run. After these large ones I'll send some last small and fast ones again.


No joy with my computers in getting any work. Since they are remote I can't access them to request new work units. Shouldn't be an issue with 1 machine as it has 64GB ram and 4 TB hard drive. The other has 32 GB but also a 4 TB HDD. Just waiting to see if they get any work units to crunch.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51242 - Posted: 10 Jan 2019 | 16:20:57 UTC - in response to Message 51239.

No joy with my computers in getting any work.

Do you have "Run test applications?" selected in your preferences? I am getting them fairly regularly now.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51243 - Posted: 10 Jan 2019 | 16:49:12 UTC

Can you set it so that each Host gets one beta before a single Host gets 87 ???

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51246 - Posted: 10 Jan 2019 | 17:30:38 UTC - in response to Message 51242.
Last modified: 10 Jan 2019 | 17:31:01 UTC

No joy with my computers in getting any work.

Do you have "Run test applications?" selected in your preferences? I am getting them fairly regularly now.



Had the beta but not the "Run test applications" selected. Thanks for that. Made the change, will see if that works.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51247 - Posted: 10 Jan 2019 | 18:23:22 UTC
Last modified: 10 Jan 2019 | 18:25:44 UTC

Each BOINC slot with a QC job is about 29 GB, so plan accordingly. I am running two now, which uses all eight cores and still leaves 29 GB free, so I think I am safe for now.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 51249 - Posted: 10 Jan 2019 | 18:40:24 UTC - in response to Message 51247.

Current test WUs are limited to 30GB, but many are failing because the limit is hit. I will need to raise the limit (the production QC app is 60 GB).

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51251 - Posted: 10 Jan 2019 | 19:09:13 UTC - in response to Message 51249.

I will need to raise the limit (the production QC app is 60 GB).

I will put in a 256 GB SSD, but I have a 500 GB sitting on the shelf too. Let us know what you need.

Ola
Send message
Joined: 8 Apr 18
Posts: 21
Credit: 1,309,700
RAC: 0
Level
Ala
Scientific publications
wat
Message 51252 - Posted: 10 Jan 2019 | 19:14:02 UTC

Two tasks have "restarted" themself and started run from th beginning when they achieved about 40%. Just before it my CPU has heated suddenly to over 80 Celsius degrees, fortunately only for a moment.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51254 - Posted: 10 Jan 2019 | 21:26:37 UTC
Last modified: 10 Jan 2019 | 21:36:53 UTC

The default BOINC settings for disk memory are far too low for your experiment. The RAM requirements are all over the map. You're sending these jobs to Xeons with 24 to 44 logical CPUs that can only run one or two at a time.

Why don't these WUs respect the client's resources??? E.g., if I have 16 GB RAM you'll start 5 or 6 WUs and consume all RAM and the computer freezes. If I have an app_config.xml set limiting QC or QC_beta to one WU (4C needs 16 GB RAM) then you want 60 GB of disk memory which is fine if I've set BOINC high enough.

Did you think this through before you launched this mayhem??? You should be able to tell us what these requirements are before launching. 4 GB per CPU thread per WU = 16 GB RAM is 4 times higher than Rosetta and by far the highest I've seen.

This is the first time in 5 hours I've been able to access this website.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51255 - Posted: 10 Jan 2019 | 21:34:43 UTC - in response to Message 51254.
Last modified: 10 Jan 2019 | 21:39:01 UTC

The default BOINC settings for disk memory are far too low for your experiment.

Good point. I don't remember what the defaults are, but I always increase them anyway. They should probably create a sticky on how to set up BOINC for this project. Also, a sample app_config.xml file on how to limit the number of work units operating at a time would be helpful.

Some projects allow you to limit the number of work units downloaded at at time (and even the number of cores to be used per work unit) on their preferences page. That would be even easier.

I have trouble connecting here too. It seems to be a problem for everyone outside of Spain, but especially in the U.S.

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 511
Credit: 4,672,242,755
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51256 - Posted: 10 Jan 2019 | 21:39:44 UTC

My windows machine is doing QC WUs, is this supposed to happen?

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51258 - Posted: 10 Jan 2019 | 22:47:08 UTC

PappaLito said:

My windows machine is doing QC WUs, is this supposed to happen?

Could be, I also have 2 running on a Windows machine.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51261 - Posted: 11 Jan 2019 | 0:02:24 UTC

So about half are failing the other half are finishing. They are using all of the threads on my CPUs. I had a max concurrent in my app_config but I think these are a different name so that limitation is being ignored. I should have tried a <project_max_concurrent> but didn't have time to play with it to see if it would restrict the number running. Guess that will have to wait until Saturday when I can re-access those machines.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51263 - Posted: 11 Jan 2019 | 1:46:23 UTC - in response to Message 51261.
Last modified: 11 Jan 2019 | 2:23:54 UTC

So about half are failing the other half are finishing. They are using all of the threads on my CPUs.

Even with a limit on the number of work units running, some of them will fail anyway with a "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" error. They need more disk memory than they are allowed at the moment, as Toni mentioned below.

Stefan
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 51272 - Posted: 11 Jan 2019 | 7:35:38 UTC
Last modified: 11 Jan 2019 | 7:35:59 UTC

Sorry to hijack again this thread. I might make a new one but since I made here the original post: the current WUs I sent to the QC app (not the beta) should run fine. They are the same molecules as before. I changed my mind about submitting the other monsters, I'll try to run them on our local cluster first.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51273 - Posted: 11 Jan 2019 | 9:48:32 UTC - in response to Message 51272.

I changed my mind about submitting the other monsters, I'll try to run them on our local cluster first.

Well I was just about to put in a 500 GB SSD, but I will wait until you give us the heads up.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 51278 - Posted: 11 Jan 2019 | 14:26:06 UTC

After some times i restart my linux box and.... 226 mb of upload of a "Toni" quantum chemistry wu!!
Ouch

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51279 - Posted: 11 Jan 2019 | 14:31:00 UTC

The QC per core RAM load has dropped. What's today's requirement???

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 3,225,564,275
RAC: 50,539,994
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51287 - Posted: 12 Jan 2019 | 9:32:19 UTC
Last modified: 12 Jan 2019 | 9:38:32 UTC

What are the requirements to receive Regular Non BETA QC Wu's ??? I have never received 1 even though I have been calling for them all along. I can get the BETA QC Wu's so one would think I should be able get the Regular QC Wu's. Running Windows 10 Pro with 64gb's Memory on a Intel i7 7770k CPU.

When my Box calls for Wu's I get this:

PBOYZTOYNP9873 13027 GPUGRID 1/12/2019 4:35:13 AM No tasks are available for Quantum Chemistry


Server Status shows Quantum Chemistry 125,158 Unsent ... Thanks
____________
STE\/E

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51292 - Posted: 12 Jan 2019 | 14:25:33 UTC

STEVE, Have you gone to BOINCmgr/Options/Computing Preferences/Disk and Memory??? The combination of those 3 checkboxes under Disk must yield over 60 GB of free disk space. May be 60 GB per QC WU.

In my case I was getting QCs but they seem to have stopped over night. No idea why.

How many months will it take to clear 125,000 WUs???

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51293 - Posted: 12 Jan 2019 | 14:44:26 UTC - in response to Message 51292.
Last modified: 12 Jan 2019 | 14:47:07 UTC

The combination of those 3 checkboxes under Disk must yield over 60 GB of free disk space. May be 60 GB per QC WU.

As I read Stefan's post, they have backed off from the large ones, and have gone back to the earlier ones.
So 30 GB (per work unit) should do; that is what I have been getting.
http://www.gpugrid.net/forum_thread.php?id=4785&nowrap=true#51272

It is entirely possible that they will go to the big ones (or even bigger) in the future, unless they can figure out how to break up the work into smaller pieces.

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 3,225,564,275
RAC: 50,539,994
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51294 - Posted: 12 Jan 2019 | 15:00:59 UTC - in response to Message 51292.
Last modified: 12 Jan 2019 | 15:06:56 UTC

STEVE, Have you gone to BOINCmgr/Options/Computing Preferences/Disk and Memory??? The combination of those 3 checkboxes under Disk must yield over 60 GB of free disk space. May be 60 GB per QC WU.

In my case I was getting QCs but they seem to have stopped over night. No idea why.

How many months will it take to clear 125,000 WUs???


I have a 1 Terabyte HD that I run Boinc on, there is 878gb of free space on it, there should be more than enough space to run the Wu's.

Disk: use at most 750 GB
Disk: leave free at least 1 GB
Values smaller than 0.001 are ignored GB
Disk: use at most 100 % of total
Tasks checkpoint to disk at most every 60 seconds
Swap space: use at most 95% of total
Memory: when computer is in use, use at most 95% of total
Memory: when computer is not in use, use at most 98% of total
____________
STE\/E

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51296 - Posted: 12 Jan 2019 | 15:52:39 UTC

STEVE asked,

What are the requirements to receive Regular Non BETA QC Wu's ???

As far as I know, regular non BETA QC WU's are still Linux only. I am getting plenty on my Linux box, none on a Windows box.

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 3,225,564,275
RAC: 50,539,994
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51297 - Posted: 12 Jan 2019 | 16:05:06 UTC - in response to Message 51296.

STEVE asked,

What are the requirements to receive Regular Non BETA QC Wu's ???

As far as I know, regular non BETA QC WU's are still Linux only. I am getting plenty on my Linux box, none on a Windows box.


Yeah that's what I was afraid of, guess they don't need the Wu's run very fast if it's Linux only. Why would they test the BETA Wu's on Windows Machines but not run the Regular Wu's on them, weird ...

____________
STE\/E

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 3,570,519,747
RAC: 18,202,645
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51298 - Posted: 12 Jan 2019 | 16:44:13 UTC

STEVE,

I prefer this line of thinking, general business users usually run Windows, graphics artist and publishers usually run Macs, and scientists usually run Linux. Almost all of the top 500 super-computers in the world run on Linux. I'm guessing that most of the scientific research applications natively run on Linux. It was easier for the GPUGRID scientist to get the QC CPU app running on Linux PC's so that was the initial release.

They tried to get the QC CPU to run on Windows a few months ago using the Windows Subsystem for Linux (WSL) but ran into problems with inconsistency between Windows releases. This is their next attempt to get the QC app to run on Windows. Looks like this one might work when they get the parameters set up for the size of tasks involved.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51301 - Posted: 12 Jan 2019 | 22:10:05 UTC

Well I tried to install a <project_max_concurrent> into the app_config but the beta work units just ignore that restriction. They were utilizing all 20 threads on the computer and starving the GPUs for resources. So I've stopped the beta for now on the machines. I have to rebuild a computer anyway this week so I think I will strip out the GPUs and leave the 16 thread CPU and install a 4 TB HDD. That should be enough I think for the beta work units and I will leave that machine only for those work units.
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51302 - Posted: 12 Jan 2019 | 22:45:39 UTC - in response to Message 51301.

Well I tried to install a <project_max_concurrent> into the app_config but the beta work units just ignore that restriction.

You could try:
<app_config>
<app>
<name>QC_beta</name>
<max_concurrent>4</max_concurrent>
</app>
</app_config>


That has worked for me in the past.

[CSF] Aleksey Belkov
Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,271,845,786
RAC: 510,000
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51303 - Posted: 12 Jan 2019 | 23:07:37 UTC - in response to Message 51302.

Jim1348 wrote:
That has worked for me in the past.

I tested yesterday Quantum Chemistry Beta for Windows and QC_beta app name works for limiting concurrent computing WUs.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 220,113
Level
Trp
Scientific publications
watwatwat
Message 51306 - Posted: 13 Jan 2019 | 13:20:42 UTC

Using <project_max_concurrent> is a problem since it counts all GG WUs. Also, they ran out of GPU WUs yesterday. The control of CPU & GPU projects should be completely separate.

<app_config>
<app>
<name>acemdlong</name>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact>1</fraction_done_exact>
</app>
<app>
<name>acemdshort</name>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact>1</fraction_done_exact>
</app>
<app>
<name>QC</name>
<max_concurrent>8</max_concurrent>
<fraction_done_exact>1</fraction_done_exact>
</app>
<app_version>
<app_name>QC</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
</app_version>
<app>
<name>QC_beta</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact>1</fraction_done_exact>
</app>
<app_version>
<app_name>QC_beta</app_name>
<plan_class>mt</plan_class>
<avg_ncpus>4</avg_ncpus>
</app_version>
</app_config>

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51309 - Posted: 13 Jan 2019 | 14:32:39 UTC - in response to Message 51302.

Well I tried to install a <project_max_concurrent> into the app_config but the beta work units just ignore that restriction.

You could try:
<app_config>
<app>
<name>QC_beta</name>
<max_concurrent>4</max_concurrent>
</app>
</app_config>


That has worked for me in the past.


Thanks Jim, I'll give that a shot later this week when I get back to those computers. Since there doesn't appear to be any more beta at the moment. I"ll put that in the app_config for when and if those betas return.
____________

Ola
Send message
Joined: 8 Apr 18
Posts: 21
Credit: 1,309,700
RAC: 0
Level
Ala
Scientific publications
wat
Message 51312 - Posted: 13 Jan 2019 | 22:16:34 UTC

Two tasks have been frozen at 1,098% despite they ahievied it very quickly. I dont't think the app is good, yet...

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 3,225,564,275
RAC: 50,539,994
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 51316 - Posted: 14 Jan 2019 | 9:58:27 UTC - in response to Message 51312.

Two tasks have been frozen at 1,098% despite they ahievied it very quickly. I dont't think the app is good, yet...


None of the BETA Wu's freeze up on my Box but it's 50/50 whether they get a computation error or not, They all seem to run 30 Min's even if they get the computation error, some finish okay but show up as Invalid here at the site ...

____________
STE\/E

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 158
Credit: 388,132
RAC: 0
Level

Scientific publications
wat
Message 51324 - Posted: 15 Jan 2019 | 15:40:08 UTC - in response to Message 51312.

Two tasks have been frozen at 1,098% despite they ahievied it very quickly. I dont't think the app is good, yet...


My "_TONI_" wus for linux stuck at 10% and time remaining continue to grow up

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51382 - Posted: 27 Jan 2019 | 2:39:20 UTC

Off to a rough start but think I got the kinks worked out. Got my 16 thread i7 on an Alphacool 360mm Radiator. 3 work units at a time with 4 threads, holding around 61-62C. Will see how it runs over the next 24 hours.
____________

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51464 - Posted: 11 Feb 2019 | 16:26:57 UTC - in response to Message 51382.

So this is the biggest one that I've crunched so far.


____________

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51481 - Posted: 13 Feb 2019 | 16:28:51 UTC - in response to Message 51464.
Last modified: 13 Feb 2019 | 16:29:51 UTC

Starting to see lot of big ones lately but still none of those Giant one that we had a few months ago.

20239542 15931887 13 Feb 2019 | 4:52:27 UTC 13 Feb 2019 | 10:28:09 UTC

Completed and validated 5,123.56 19,627.41 1,235.32 Quantum Chemistry v3.31 (mt)
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51482 - Posted: 13 Feb 2019 | 17:08:41 UTC - in response to Message 51481.

Starting to see lot of big ones lately but still none of those Giant one that we had a few months ago.

It is not as much fun working on the small ones. Maybe when they get the Windows app developed, they will have enough users that they can offer "small" and "large" work units, as defined by disk requirements (or even main memory).

rbpeake
Send message
Joined: 30 Jul 08
Posts: 17
Credit: 80,343,188
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwat
Message 51483 - Posted: 13 Feb 2019 | 20:40:10 UTC - in response to Message 51482.

Starting to see lot of big ones lately but still none of those Giant one that we had a few months ago.

It is not as much fun working on the small ones. Maybe when they get the Windows app developed, they will have enough users that they can offer "small" and "large" work units, as defined by disk requirements (or even main memory).


Are they working on a Windows app?

Thanks!

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51484 - Posted: 13 Feb 2019 | 23:01:28 UTC - in response to Message 51483.

Are they working on a Windows app?

Very much so. I think they are making progress.
http://www.gpugrid.net/forum_thread.php?id=4790

Profile Beyond
Avatar
Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51489 - Posted: 14 Feb 2019 | 15:49:18 UTC - in response to Message 51249.

Current test WUs are limited to 30GB, but many are failing because the limit is hit. I will need to raise the limit (the production QC app is 60 GB).

Wow, these things are going to be rough on SSD drive life. How large are the uploads and downloads?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51492 - Posted: 14 Feb 2019 | 17:05:40 UTC - in response to Message 51489.

Wow, these things are going to be rough on SSD drive life. How large are the uploads and downloads?

They have gone back to small ones. Running 11 of them on my i7-8770 (single core), my project folder is only 2.1 GB. It is no fun at all.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51493 - Posted: 14 Feb 2019 | 17:51:12 UTC - in response to Message 51489.

Current test WUs are limited to 30GB, but many are failing because the limit is hit. I will need to raise the limit (the production QC app is 60 GB).

Wow, these things are going to be rough on SSD drive life. How large are the uploads and downloads?



You could do like I did and install HDD. Those are fairly reasonable in price.Still have OS on the SSD but have BOINC installed on the HDD. Or if it's a new system, just install a HDD. I think I got 4 TB HDD for like $110 USD
____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 2,083,841
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51495 - Posted: 15 Feb 2019 | 18:23:54 UTC

I've had a few "error while computing" for QC tasks on linux recently. I had 4 on Feb 14th and 3 on Feb 13th but none today yet.

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
17:04:03 (7591): wrapper (7.7.26016): starting
17:04:03 (7591): wrapper (7.7.26016): starting
17:04:03 (7591): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p qmml3 --override-channels -c defaults -c gpugrid --file requirements.txt ")
Python 3.6.5 :: Anaconda, Inc.

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

`$ /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p qmml3 --override-channels -c defaults -c gpugrid --file requirements.txt`

environment variables:
CIO_TEST=<not set>
CONDA_ROOT=/var/lib/boinc-client/projects/www.gpugrid.net/miniconda
PATH=/usr/bin:/bin
REQUESTS_CA_BUNDLE=<not set>
SSL_CERT_FILE=<not set>

active environment : None
user config file : /var/lib/boinc-client/.condarc
populated config files :
conda version : 4.5.4
conda-build version : not installed
python version : 3.6.5.final.0
base environment : /var/lib/boinc-client/projects/www.gpugrid.net/miniconda (writable)
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/free/linux-64
https://repo.anaconda.com/pkgs/free/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/pro/linux-64
https://repo.anaconda.com/pkgs/pro/noarch
https://conda.anaconda.org/gpugrid/linux-64
https://conda.anaconda.org/gpugrid/noarch
package cache : /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/pkgs
/var/lib/boinc-client/.conda/pkgs
envs directories : /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs
/var/lib/boinc-client/.conda/envs
platform : linux-64
user-agent : conda/4.5.4 requests/2.18.4 CPython/3.6.5 Linux/4.15.0-45-generic linuxmint/19.1 glibc/2.27
UID:GID : 122:129
netrc file : None
offline mode : False


V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V

CondaHTTPError: HTTP 504 GATEWAY_TIMEOUT for url <https://conda.anaconda.org/gpugrid/linux-64/repodata.json>
Elapsed: 00:59.680897
CF-RAY: 4a92d5abb8af46f2-EWR

A remote server error occurred when trying to retrieve this URL.

A 500-type error (e.g. 500, 501, 502, 503, etc.) indicates the server failed to
fulfill a valid request. The problem may be spurious, and will resolve itself if you
try your request again. If the problem persists, consider notifying the maintainer
of the remote server.

A reportable application error has occurred. Conda has prepared the above report.
Upload successful.
17:06:11 (7591): /usr/bin/flock exited; CPU time 14.574405
17:06:11 (7591): app exit status: 0x1
17:06:11 (7591): called boinc_finish(195)

</stderr_txt>
]]>

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51497 - Posted: 15 Feb 2019 | 21:41:43 UTC - in response to Message 51495.

Yeah I got a bunch of those as well. I think it's having an issue either downloading or connecting to get the required data. Doesn't happen often but when does, it usually results in numerous errors.
____________

tullio
Send message
Joined: 8 May 18
Posts: 190
Credit: 104,426,808
RAC: 0
Level
Cys
Scientific publications
wat
Message 51574 - Posted: 25 Feb 2019 | 17:52:44 UTC

I am running QC on a HP Laptop with E-450 CPU, 8 GB RAM, a 1 TB hard disk which has a SSD partition of 8 GB. No problem so far after two SSD disks on the same laptop failed one after the other running BOINC tasks.
Tullio
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51575 - Posted: 25 Feb 2019 | 18:34:20 UTC - in response to Message 51495.

A remote server error occurred when trying to retrieve this URL.

A 500-type error (e.g. 500, 501, 502, 503, etc.) indicates the server failed to
fulfill a valid request. The problem may be spurious, and will resolve itself if you
try your request again. If the problem persists, consider notifying the maintainer
of the remote server.

I get around a 1% to 2% error rate on those, so it happens rarely. But they usually complete on other machines, so it is not the work units but the communications somewhere.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 211
Credit: 4,496,324,562
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 51698 - Posted: 15 Apr 2019 | 21:01:22 UTC

Upgrade my main CPU cruncher to 420 mm Rad with push pull and 4 TB HDD. Need to tweek it just a bit but seems stable at 5 work units with 4 threads apiece. Would include a pic but would take up a lot of space.
____________

Post to thread

Message boards : Multicore CPUs : New QC app

//