Advanced search

Message boards : Multicore CPUs : New QC app

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 49759 - Posted: 2 Jul 2018 | 10:18:54 UTC
Last modified: 2 Jul 2018 | 10:52:36 UTC

Dears, after a hot weekend during which I accidentally cancelled QC WUs, we are ready to start again with a new app. As soon as we get it right, we should be able to run on more machines (gcc no longer a requirement).

There will be a largish download the first time you run app 329. If you want to free up some disk space, please reset the project (recommended, but not urgent).

kain
Send message
Joined: 3 Sep 14
Posts: 139
Credit: 217,157,977
RAC: 236,869
Level
Leu
Scientific publications
watwatwatwatwat
Message 49760 - Posted: 2 Jul 2018 | 11:29:25 UTC

Ready for action ;)

tullio
Send message
Joined: 8 May 18
Posts: 102
Credit: 8,462,438
RAC: 88,166
Level
Ser
Scientific publications
wat
Message 49766 - Posted: 2 Jul 2018 | 16:30:41 UTC

First 330 task completed and validated. The second one is waiting for memory alongside a GPU task, which is however using only 4% of the total 8 GB of RAM.
Tullio

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 311
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 49778 - Posted: 4 Jul 2018 | 8:47:07 UTC - in response to Message 49766.

I will let the WUs run out for a day because I want to see if something weird is happening on my side (the WUs are calculating fine, don't worry). I'll submit more once they are completed tomorrow.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 883
Credit: 1,726,038,170
RAC: 1,104,572
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50348 - Posted: 30 Aug 2018 | 13:43:18 UTC

@ Toni, @ Stefan

There's an error report in Number Crunching (This computer has finished a daily quota of 31 tasks) which suggests that the maximum upload size for a batch of QC tasks has been set too low.

Task name is 6955_1_15_16_18_dd130713_n00001-SDOERR_SELE2-0-1-RND2528

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 883
Credit: 1,726,038,170
RAC: 1,104,572
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50349 - Posted: 30 Aug 2018 | 13:43:42 UTC

@ Toni, @ Stefan

There's an error report in Number Crunching (This computer has finished a daily quota of 31 tasks) which suggests that the maximum upload size for a batch of QC tasks has been set too low.

Task name is 6955_1_15_16_18_dd130713_n00001-SDOERR_SELE2-0-1-RND2528

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50404 - Posted: 5 Sep 2018 | 14:34:10 UTC - in response to Message 50349.
Last modified: 5 Sep 2018 | 14:34:49 UTC

I think the actual error is a segmentation fault, which leaves large temporary files behind, and their transfer is attempted. Let's see if the situation improves with the new version.

If in doubt, please reset the project.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 57
Credit: 2,019,205,574
RAC: 4,624,706
Level
Phe
Scientific publications
watwatwat
Message 50412 - Posted: 6 Sep 2018 | 5:32:45 UTC - in response to Message 50404.
Last modified: 6 Sep 2018 | 5:33:01 UTC

Just curious, is there someplace that tells how well the QC apps are running like on the server status page. I can look at my own machines and see the errors there but overall is there someplace like for the GPU apps?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 883
Credit: 1,726,038,170
RAC: 1,104,572
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50415 - Posted: 6 Sep 2018 | 8:25:58 UTC - in response to Message 50404.

I think the actual error is a segmentation fault, which leaves large temporary files behind, and their transfer is attempted. Let's see if the situation improves with the new version.

BOINC won't attempt to upload a temporary file unless its name is specified with an upload URL in the workunit template.

I'll be able to advise better when you release the Windows app, and I can see any problems happening on my own machines.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50417 - Posted: 6 Sep 2018 | 8:46:57 UTC - in response to Message 50415.

May well be restricted to a few machines.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 883
Credit: 1,726,038,170
RAC: 1,104,572
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50418 - Posted: 6 Sep 2018 | 9:42:25 UTC - in response to Message 50417.

May well be restricted to a few machines.

If you want to count me in, I'll do my best to report on any issues that may arise (in beta mode if necessary). I'm primarily Windows 7, so the WSL approach would be difficult except for one dual-boot test machine with Windows 10 ready to run.

tullio
Send message
Joined: 8 May 18
Posts: 102
Credit: 8,462,438
RAC: 88,166
Level
Ser
Scientific publications
wat
Message 50419 - Posted: 6 Sep 2018 | 10:21:20 UTC

My main Linux box is crunching SELE6 on its Opteron 1210. If necessary, I have a Windows 10 PC with an AMD A10-6700 and 22 GB RAM.
Tullio
____________

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 96
Credit: 252,222
RAC: 512
Level

Scientific publications
wat
Message 50420 - Posted: 6 Sep 2018 | 12:24:27 UTC

again

<message>
upload failure: <file_xfer_error>
<file_name>5516_14_15_18_19_8125b500_n00001-SDOERR_SELE6-0-1-RND8343_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


Have i had to reset the project?

Profile Chilean
Avatar
Send message
Joined: 8 Oct 12
Posts: 86
Credit: 151,726,480
RAC: 165,117
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwat
Message 50421 - Posted: 6 Sep 2018 | 12:26:27 UTC

CPU apps not working very well. Lots of idle time. It maybe because of the 48 threads... some pythons use 4 cores, the rest just 1 and not all the time. It's gotta be an I/O issue (?).
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50422 - Posted: 6 Sep 2018 | 13:18:05 UTC - in response to Message 50421.

Each task should use 4 threads max.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 96
Credit: 252,222
RAC: 512
Level

Scientific publications
wat
Message 50423 - Posted: 6 Sep 2018 | 15:16:03 UTC - in response to Message 50420.

again
<message>
upload failure: <file_xfer_error>
<file_name>5516_14_15_18_19_8125b500_n00001-SDOERR_SELE6-0-1-RND8343_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>


Have i had to reset the project?


Another clue: this happens when i reboot the virtual machine (and the wu restarts)

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50424 - Posted: 6 Sep 2018 | 17:11:33 UTC
Last modified: 6 Sep 2018 | 17:13:33 UTC

I have had good luck for the past day running QC on my i7-4770 (Ubuntu 16.04). That doesn't prove much, except that there is no fatal flaw in all the work units.

And I am limiting them to two cores per work unit, and three work units at a time. That gives me essentially the same output as four cores on two work units at a time, but leaves over a little more CPU support for my GTX 1070 on Folding. All in all, it seems to be working fine.

(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 883
Credit: 1,726,038,170
RAC: 1,104,572
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 50425 - Posted: 6 Sep 2018 | 17:32:31 UTC - in response to Message 50423.

(and the wu re-starts)

I think that is indeed a clue. One of the mechanisms I considered was a WU re-starting, and appending a second result to an existing file, doubling its size.

Checking the 'headroom' between the typical result file size and <max_nbytes> was one of the tests I had in mind for the Windows version. Can anyone comment?

tullio
Send message
Joined: 8 May 18
Posts: 102
Credit: 8,462,438
RAC: 88,166
Level
Ser
Scientific publications
wat
Message 50426 - Posted: 6 Sep 2018 | 18:15:16 UTC

Most LHC users are Windows users and they run Scientific Linux research programs from CERN (not BOINC programs) using Virtual Machines.
Tullio

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50427 - Posted: 7 Sep 2018 | 11:41:06 UTC
Last modified: 7 Sep 2018 | 11:42:54 UTC

This is interesting. Twice overnight my PC crashed. Shut down. Didn't run.
But each time, I saw no errors in the BoincTasks log, or in the Folding log either. The Folding work unit just continued from where it left off.

But each time, a new set of (three) QC tasks started after starting up the PC.

And now I see the same old error message in the stderr.txt file:

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>2360_16_18_20_21_e0c95459_n00001-SDOERR_SELE6-0-1-RND3935_0_1</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>

http://www.gpugrid.net/result.php?resultid=18686175

The messages don't appear immediately after the crashes, but a few hours later.
And I just attached this machine to GPUGrid a couple of days ago. So it didn't take long.

[VENETO] boboviz
Send message
Joined: 10 Sep 10
Posts: 96
Credit: 252,222
RAC: 512
Level

Scientific publications
wat
Message 50430 - Posted: 7 Sep 2018 | 13:45:25 UTC - in response to Message 50424.

(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.)


Do you think i have to use a physical linux machine?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50431 - Posted: 7 Sep 2018 | 13:59:07 UTC - in response to Message 50430.

(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.)


Do you think i have to use a physical linux machine?

Don't know. Real Linux hasn't fixed the problem for me. But the VM's just add another layer of complexity, and from what I understand (I am not an expert), hide the various other problems. At least that is what they say on the LHC forum, where they would like to get away from VirtualBox if that were possible. It is why they developed native ATLAS, and would like to do that for the other projects if it were possible.

tullio
Send message
Joined: 8 May 18
Posts: 102
Credit: 8,462,438
RAC: 88,166
Level
Ser
Scientific publications
wat
Message 50432 - Posted: 7 Sep 2018 | 14:42:04 UTC - in response to Message 50431.
Last modified: 7 Sep 2018 | 14:48:30 UTC

I am running GPUGRID,both CPU an GPU, on a SuSE Linux box with a GTX 750 Ti GPU board. I am running Atlas@home of LHC, on a Windows 10 PC with 4 cores (but Task Manager says two cores and 4 logical processors on a AMD A10-6700 CPU). It has a GTX 1050 Ti GPU board, but GPUGRID overheats it to 80 C and it crashes, so I am running Atlas (no GPU but VirtuaBox 5.2.18), Einstein@home and SETI@home, both CPU and GPU.
Tullio
Atlas native runs only on Ubuntu Linux,it does not run on my SuSE Linux nor on Windows.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50433 - Posted: 7 Sep 2018 | 17:25:03 UTC - in response to Message 50432.

Restarts may be a problem. It's not really the output file size (which should be small), but the fact that temporary files are not deleted as the consequence of some other error.

Can someone reliably reproduce the problem (i.e., stopping and restarting a WU)? If it does, is the problem solved enabling the "keep application in memory" option?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50434 - Posted: 7 Sep 2018 | 17:59:00 UTC - in response to Message 50433.
Last modified: 7 Sep 2018 | 18:32:00 UTC

If it does, is the problem solved enabling the "keep application in memory" option?

I don't think I can reliably reproduce it, but I have "Leave application in memory" enabled (as is my usual practice), and that does not prevent it.

EDIT: Also, I should point out that there were no other BOINC applications running, and my machine runs 24/7, so the QC work units were never being suspended anyway.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 57
Credit: 2,019,205,574
RAC: 4,624,706
Level
Phe
Scientific publications
watwatwat
Message 50448 - Posted: 8 Sep 2018 | 16:46:20 UTC
Last modified: 8 Sep 2018 | 16:47:25 UTC

So I'm curious. When we see the results for a work unit for QC it list both a run time and a cpu time. Since we are so used to run time and cpu time beling a linear measurement, I was wondering if CPU time for the QC is actually a combined sum total for all CPU threads being used.

I've calculated out how long the CPU time is and it's much higher than the actual run time I'm seeing on the machine. What does make sense is if it's the sum total of 4 threads all running at the time time for a set amount of time.

CPU time = N threads x actual run time

Is this correct? Mostly for my own curiosity.
____________

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 399
Credit: 2,736,116,243
RAC: 1,021,958
Level
Phe
Scientific publications
watwat
Message 50449 - Posted: 8 Sep 2018 | 16:58:35 UTC

Since the utilization of the CPU is so low on these WUs I presume it doesn't count most of the run time as CPU time because the CPU is waiting for the hard drive.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50451 - Posted: 9 Sep 2018 | 14:22:48 UTC
Last modified: 9 Sep 2018 | 14:58:04 UTC

I thought I would try my Ryzen 1700 on QC (Ubuntu 18.04), to see if it would behave differently than my Intel machines (i7-4770, i7-8700).
The first work unit errored:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/r/noarch/repodata.json.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

If your current network has https://www.anaconda.com blocked, please file
a support request with your network engineering team.

ConnectionError(MaxRetryError("HTTPSConnectionPool(host='repo.anaconda.com', port=443): Max retries exceeded with url: /pkgs/r/noarch/repodata.json.bz2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f60af4b15c0>: Failed to establish a new connection: [Errno -2] Name or service not known',))",),)

A reportable application error has occurred. Conda has prepared the above report.
Upload did not complete.10:07:33 (1830): /usr/bin/flock exited; CPU time 11.796829
10:07:33 (1830): app exit status: 0x1
10:07:33 (1830): called boinc_finish(195)


This may be a somewhat different error message than the others, but it seems to me that they are all communications-related. I suspect it has to do with the intermittent connections I have been getting to GPUGrid for the past several weeks/months, as previously discussed.
http://www.gpugrid.net/forum_thread.php?id=4806

EDIT: The next two are running OK, and it looks like they will complete normally. It is a very intermittent problem.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 48
Credit: 641,934,862
RAC: 1,285,707
Level
Lys
Scientific publications
watwatwatwatwat
Message 50453 - Posted: 10 Sep 2018 | 4:20:16 UTC - in response to Message 50448.

So I'm curious. When we see the results for a work unit for QC it list both a run time and a cpu time. Since we are so used to run time and cpu time beling a linear measurement, I was wondering if CPU time for the QC is actually a combined sum total for all CPU threads being used.

I've calculated out how long the CPU time is and it's much higher than the actual run time I'm seeing on the machine. What does make sense is if it's the sum total of 4 threads all running at the time time for a set amount of time.

CPU time = N threads x actual run time

Is this correct? Mostly for my own curiosity.


I suspect that cpu time = (0.95 X N cpu's X run time + any time the computer is in use and background overhead).

These darn newer WU's are so memory hungry that latency on all my machines running them is so long, the computers are becoming useless to me. I may have to quit running these until the memory issue is addressed. Both my FX 8 cores have 16GB ram and with one QC job running just 4 cores, my swap usage is as high as 7% and it takes forever to get the machine to do what I need. Plus I had to repartition the root dirs on 6 machines to accomodate the increased HD demands.

Not sure how much longer I can hold out.

Stefan
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 5 Mar 13
Posts: 311
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 50454 - Posted: 10 Sep 2018 | 7:35:24 UTC - in response to Message 50453.

The QC jobs should use 4GB of RAM each. If you are swapping just don't run as many in parallel. You will never finish them anyway if you end up swapping.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 48
Credit: 641,934,862
RAC: 1,285,707
Level
Lys
Scientific publications
watwatwatwatwat
Message 50464 - Posted: 10 Sep 2018 | 18:55:35 UTC - in response to Message 50454.

The QC jobs should use 4GB of RAM each. If you are swapping just don't run as many in parallel. You will never finish them anyway if you end up swapping.


Yes, understood and confirmed however:

These last two posts by me are provided simply as FYI hopefully for the benefit of the project and are a summary of my experiences crunching the QC WU's, not a complaint.

Before these newer QC WU's, I was able to run two mt jobs 4 threads each (following fixing the simultaneous start bug) and acemd or E@H concurrently on my two FX machines without any memory or latency issues. With these newer jobs, I can only run a single WU 4 threads (WCG and acemd on the remaining cores). After about 12 hours or so of run time, the swap file begins to be utilized and of course latency (my intervention to use the computer) starts increasing up to seconds before responding. I have not found out what application(s) actually use the extra ram causing the swap to be invoked (probably not the QC app because they finish quickly, usually around 15 - 30 mins) but the swap usage gradually increases with time with 7% usage being highest observed to date. The swap does not appear to be utilized consistently but rather for short increments of time even when I am not using the machine. Is it possible that the QC app is returning most but not all memory it uses back to the memory pool as calc's are completed?

I have 6 computers running QC with 4 being 4 core headless crunchers only and as long as they provide valid results I leave them alone but the two FX machines with consoles, the latancy issue leaves me with little choice but to consider stop running the QC apps on the FX's, or perhaps cut them down to 2 or 3 threads to see if that works. I will try the latter before I stop QC on the FX's but that is going increase turnaround time and undo the benefits of using the extra ram and multi-cores to speed turnaround time seemingly causing a catch 22 situation (2 concurrent WU's 4 threads each taking longer but returning 2 WU's/real time vs running a single WU 2 threads quicker but returning 1 WU/real time).

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50465 - Posted: 10 Sep 2018 | 20:17:36 UTC - in response to Message 50464.

Is it possible that the QC app is returning most but not all memory it uses back to the memory pool as calc's are completed?

That is an interesting question, and could explain some of the random errors I have been getting. But QC is running OK now on an i7-3770, running four work units at a time with 2 cores per WU. I see memory usage up around 4 GB per work unit though, so it is fortunate I have 32 GB. That has not prevented the errors in the past on comparable machines, but I have found that when it works, don't touch it.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 48
Credit: 641,934,862
RAC: 1,285,707
Level
Lys
Scientific publications
watwatwatwatwat
Message 50478 - Posted: 12 Sep 2018 | 2:43:32 UTC - in response to Message 50465.


That is an interesting question, and could explain some of the random errors I have been getting. But QC is running OK now on an i7-3770, running four work units at a time with 2 cores per WU. I see memory usage up around 4 GB per work unit though, so it is fortunate I have 32 GB. That has not prevented the errors in the past on comparable machines, but I have found that when it works, don't touch it.


Same here, don't mess with a working situation. I should have gone full 32 GiB when I last upgraded RAM. Darn, I went 4 x 2 initially and later added 4 x 2 more so now I have to buy all 32 G (8 x 4) rather than just add 16 G more, to the tune of around 250 USD each FX box. My rule has been 2 G per thread/core but in this situation 4 G /thread appears minimum, in fact, all my ATX can take.

While running 4 WU's 2 threads each, have you noticed any swap file usage with the 32 G RAM?

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50480 - Posted: 12 Sep 2018 | 10:29:26 UTC - in response to Message 50478.
Last modified: 12 Sep 2018 | 10:33:36 UTC

While running 4 WU's 2 threads each, have you noticed any swap file usage with the 32 G RAM?

I have set swappiness to never use swap: sudo sysctl vm.swappiness=0

But I don't think I would notice it anyway, since it is a dedicated machine and I don't have a way to check it. But whenever I run "free", I always see plenty of free/available memory.

Currently, it is 3 GB free, and 22 GB available, but it varies a lot. However, I haven't seen less than 18 GB available.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50481 - Posted: 12 Sep 2018 | 10:54:04 UTC - in response to Message 50478.

I should have gone full 32 GiB when I last upgraded RAM. Darn, I went 4 x 2 initially and later added 4 x 2 more so now I have to buy all 32 G (8 x 4) rather than just add 16 G more, to the tune of around 250 USD each FX box. My rule has been 2 G per thread/core but in this situation 4 G /thread appears minimum, in fact, all my ATX can take.

Just leave it at the default 4 cores per work unit. I need the extra memory only because I am using 2 cores per work unit, and then running 4 work units at a time.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 48
Credit: 641,934,862
RAC: 1,285,707
Level
Lys
Scientific publications
watwatwatwatwat
Message 50482 - Posted: 12 Sep 2018 | 18:28:50 UTC - in response to Message 50481.

Just leave it at the default 4 cores per work unit.


Experimenting, I went to 3 cores with only 1 QC job at a time and that helped just about eliminate the user latency issues. I can live with it now but as expected, the real times increased. Now, after a fresh boot, I have plenty of free memory but over time it starts pushing toward the limit. Would be interesting to find out if the QC app is faithfully returning all memory used back to the system after each of the calc's.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50483 - Posted: 12 Sep 2018 | 21:43:24 UTC - in response to Message 50482.

Would be interesting to find out if the QC app is faithfully returning all memory used back to the system after each of the calc's.

Even though I show 3 GB memory free, and 21 GB available at the moment, it still shows that 361 MB of the swap file is used (out of 2 GB total), even though I have swappiness set to 0. I don't know what that means, but the machine has not been rebooted for three days, and still has plenty of memory left. But it could be using it up.

(It is getting hard to post again, due to all the browser timeouts. It is a wonder I am able to get work at all.)

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 48
Credit: 641,934,862
RAC: 1,285,707
Level
Lys
Scientific publications
watwatwatwatwat
Message 50484 - Posted: 12 Sep 2018 | 22:03:29 UTC - in response to Message 50483.
Last modified: 12 Sep 2018 | 22:06:22 UTC

(It is getting hard to post again, due to all the browser timeouts. It is a wonder I am able to get work at all.)


I hear you re the website. I have been having the same issues for several days with the browser timeouts and having to reesend data just to complete a post.

On the swap file issue on your machine, try swapoff -a as root or sudo. That for sure disables swapping. I use it to flush the swap. Use swapon -a to re-enable swap.

Edit: I use a RPM disto so not sure if swapoff is available to a DEB linux.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 614
Credit: 1,199,451,727
RAC: 134,958
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50485 - Posted: 12 Sep 2018 | 22:13:33 UTC - in response to Message 50484.

On the swap file issue on your machine, try swapoff -a as root or sudo. That for sure disables swapping. I use it to flush the swap. Use swapon -a to re-enable swap.

Very good. I did sudo swapoff -a, and now swap shows as zero. We will see how it goes.

tullio
Send message
Joined: 8 May 18
Posts: 102
Credit: 8,462,438
RAC: 88,166
Level
Ser
Scientific publications
wat
Message 50486 - Posted: 13 Sep 2018 | 3:53:03 UTC
Last modified: 13 Sep 2018 | 3:54:16 UTC

I have a GPU task and a CPU task running on my two cores Opteron 1210, 8 GB DDR2 RAM, GTX 750 Ti at 1202 MHz clock, 61 C. OS is SuSE Leap 42.3.Swap space is used at 37% of 2 GB. My HP Linux laptop, also running SuSE Leap 15.0 has a 7 GB swap space, not used. It is not running GPUGRID tasks because BOINC space is only 30 GB instead of the 760 GB of my main Linux host, a 2008 vintage SUN workstation, running 24/7 since January 2008. Hats off to SUN!
Tullio

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 48
Credit: 641,934,862
RAC: 1,285,707
Level
Lys
Scientific publications
watwatwatwatwat
Message 50490 - Posted: 13 Sep 2018 | 15:19:34 UTC - in response to Message 50485.

Very good. I did sudo swapoff -a, and now swap shows as zero. We will see how it goes.


Remember when you reboot, you will have to use the command again unless you write a script and set it up to auto run at boot.

I have a GPU task and a CPU task running on my two cores Opteron 1210, 8 GB DDR2 RAM, GTX 750 Ti at 1202 MHz clock, 61 C. OS is SuSE Leap 42.3.Swap space is used at 37% of 2 GB. My HP Linux laptop, also running SuSE Leap 15.0 has a 7 GB swap space, not used.


How is your user latency (I define as delay between user requests and computer response)? One of my FX computers has been running QC jobs about 20 hours since a fresh reboot and already swap usage is at 4% (333 of 8047 MiB) and I am only using 3 threads of 8 available. I open boincmgr and it is taking up to 10 seconds to communicate to localhost. I am really suspecting this newer QC app is leaving some small percentage of dirty ram pages behind after completing the calc's and they add up over time to swap usage and user latency.

tullio
Send message
Joined: 8 May 18
Posts: 102
Credit: 8,462,438
RAC: 88,166
Level
Ser
Scientific publications
wat
Message 50491 - Posted: 13 Sep 2018 | 16:03:35 UTC - in response to Message 50490.
Last modified: 13 Sep 2018 | 16:24:42 UTC

I am using this computer to read my mail, to navigate the WWW, read the newspapers including the NYTimes which leaves me ten free articles/month and I feel no delay. I have a 30 Mbit/s mixed fiber/copper connection by Telecom Italy, which means the the fiber reaches a cabinet that is not far from my home, then I have a copper connection. My router is also connected by WiFi to a Windows 10 PC, a printer and a decoder which gives me SKY TV on the TV set which is also the monitor of the Windows PC. I just had a Microsoft update on the Windows 10 PC and it restarted with its two Einstein@home tasks.
Tullio
I have also a smartphone running Android 7.1.1 on its 8 cores 64ARM CPU, connected by WiFi to the router. It is running SETI@HOMe and Einstein@home CPU tsks.
____________

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50492 - Posted: 13 Sep 2018 | 17:24:36 UTC - in response to Message 50491.

Concerning returning the memory when the WU is over: that's sure. The OS enforces that.

Concerning swapoff : I don't recommend that. A small amount of swap use is normal. If you remove this "safety valve", the only thing that will happen is that processes will just fail (often in confusing ways).

My suggestion is that you pay attention to your system's performance during heavy boinc use. If it becomes sluggish/irritating/unusable, swap use is probably going up -- you'll have to run fewer tasks simultaneously. Removing the swap will make the system kill them.

This said, QC tasks come in various sizes, so you may hit an "unfortunate" combination of large ones.

T[/quote]

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 57
Credit: 2,019,205,574
RAC: 4,624,706
Level
Phe
Scientific publications
watwatwat
Message 50493 - Posted: 13 Sep 2018 | 17:43:26 UTC

Doesn't happen alot but i still am getting these occasionally that cause errors.


</stderr_txt>
<message>
finish file present too long
</message>
]]>

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50494 - Posted: 13 Sep 2018 | 17:45:07 UTC - in response to Message 50493.

General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 108
Credit: 67,695,413
RAC: 558,048
Level
Thr
Scientific publications
wat
Message 50500 - Posted: 13 Sep 2018 | 21:32:53 UTC - in response to Message 50493.

Doesn't happen alot but i still am getting these occasionally that cause errors.


</stderr_txt>
<message>
finish file present too long
</message>
]]>

Those are my bane too. Nothing can be done about them. Normal feedback is to don't quit BOINC just as a task is finishing up. But it happens even when you never quit BOINC. It can happen for whatever reason that one task finishes up just as another project's task starts and BOINC takes too long to report the task. So the error.

captainjack
Send message
Joined: 9 May 13
Posts: 138
Credit: 951,578,780
RAC: 248,214
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50502 - Posted: 13 Sep 2018 | 23:26:43 UTC
Last modified: 13 Sep 2018 | 23:31:24 UTC

Toni asked:

General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps?


Just started my first QC app on windows. About 1 minute after the task started, a "black window" flashed up on the display then immediately disappeared.


About 15 minutes after the task started, there is a "python" app listed in windows task manager that I assume is the QC app. It is using all available threads on a 16 thread system. In BOINC Manager, the task shows that it should be using "4 CPU's".


Let me know if you need more info.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 732
Credit: 4,285,282
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 50505 - Posted: 14 Sep 2018 | 10:42:17 UTC - in response to Message 50502.
Last modified: 14 Sep 2018 | 10:43:18 UTC

Toni asked:

General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps?


Just started my first QC app on windows. About 1 minute after the task started, a "black window" flashed up on the display then immediately disappeared.


About 15 minutes after the task started, there is a "python" app listed in windows task manager that I assume is the QC app. It is using all available threads on a 16 thread system. In BOINC Manager, the task shows that it should be using "4 CPU's".


Let me know if you need more info.


Ok thanks. If the flashing is not annoying, I'd leave it as it is.

Regarding threads - the python app is indeed QC. Are you running multiple WUs simultaneously, or just one WU occupied all 16 threads?

Thanks a lot

captainjack
Send message
Joined: 9 May 13
Posts: 138
Credit: 951,578,780
RAC: 248,214
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 50508 - Posted: 14 Sep 2018 | 12:48:36 UTC

Toni asked:

Regarding threads - the python app is indeed QC. Are you running multiple WUs simultaneously, or just one WU occupied all 16 threads?

I was running 1 task that took all threads. I will try to get another task and run it when one becomes available.

Post to thread

Message boards : Multicore CPUs : New QC app