Message boards : Multicore CPUs : New QC app
Author | Message |
---|---|
Dears, after a hot weekend during which I accidentally cancelled QC WUs, we are ready to start again with a new app. As soon as we get it right, we should be able to run on more machines (gcc no longer a requirement). | |
ID: 49759 | Rating: 0 | rate: / Reply Quote | |
Ready for action ;) | |
ID: 49760 | Rating: 0 | rate: / Reply Quote | |
First 330 task completed and validated. The second one is waiting for memory alongside a GPU task, which is however using only 4% of the total 8 GB of RAM. | |
ID: 49766 | Rating: 0 | rate: / Reply Quote | |
I will let the WUs run out for a day because I want to see if something weird is happening on my side (the WUs are calculating fine, don't worry). I'll submit more once they are completed tomorrow. | |
ID: 49778 | Rating: 0 | rate: / Reply Quote | |
@ Toni, @ Stefan | |
ID: 50348 | Rating: 0 | rate: / Reply Quote | |
@ Toni, @ Stefan | |
ID: 50349 | Rating: 0 | rate: / Reply Quote | |
I think the actual error is a segmentation fault, which leaves large temporary files behind, and their transfer is attempted. Let's see if the situation improves with the new version. | |
ID: 50404 | Rating: 0 | rate: / Reply Quote | |
Just curious, is there someplace that tells how well the QC apps are running like on the server status page. I can look at my own machines and see the errors there but overall is there someplace like for the GPU apps? | |
ID: 50412 | Rating: 0 | rate: / Reply Quote | |
I think the actual error is a segmentation fault, which leaves large temporary files behind, and their transfer is attempted. Let's see if the situation improves with the new version. BOINC won't attempt to upload a temporary file unless its name is specified with an upload URL in the workunit template. I'll be able to advise better when you release the Windows app, and I can see any problems happening on my own machines. | |
ID: 50415 | Rating: 0 | rate: / Reply Quote | |
May well be restricted to a few machines. | |
ID: 50417 | Rating: 0 | rate: / Reply Quote | |
May well be restricted to a few machines. If you want to count me in, I'll do my best to report on any issues that may arise (in beta mode if necessary). I'm primarily Windows 7, so the WSL approach would be difficult except for one dual-boot test machine with Windows 10 ready to run. | |
ID: 50418 | Rating: 0 | rate: / Reply Quote | |
My main Linux box is crunching SELE6 on its Opteron 1210. If necessary, I have a Windows 10 PC with an AMD A10-6700 and 22 GB RAM. | |
ID: 50419 | Rating: 0 | rate: / Reply Quote | |
again <message> Have i had to reset the project? | |
ID: 50420 | Rating: 0 | rate: / Reply Quote | |
CPU apps not working very well. Lots of idle time. It maybe because of the 48 threads... some pythons use 4 cores, the rest just 1 and not all the time. It's gotta be an I/O issue (?). | |
ID: 50421 | Rating: 0 | rate: / Reply Quote | |
Each task should use 4 threads max. | |
ID: 50422 | Rating: 0 | rate: / Reply Quote | |
again Another clue: this happens when i reboot the virtual machine (and the wu restarts) | |
ID: 50423 | Rating: 0 | rate: / Reply Quote | |
I have had good luck for the past day running QC on my i7-4770 (Ubuntu 16.04). That doesn't prove much, except that there is no fatal flaw in all the work units. | |
ID: 50424 | Rating: 0 | rate: / Reply Quote | |
(and the wu re-starts) I think that is indeed a clue. One of the mechanisms I considered was a WU re-starting, and appending a second result to an existing file, doubling its size. Checking the 'headroom' between the typical result file size and <max_nbytes> was one of the tests I had in mind for the Windows version. Can anyone comment? | |
ID: 50425 | Rating: 0 | rate: / Reply Quote | |
Most LHC users are Windows users and they run Scientific Linux research programs from CERN (not BOINC programs) using Virtual Machines. | |
ID: 50426 | Rating: 0 | rate: / Reply Quote | |
This is interesting. Twice overnight my PC crashed. Shut down. Didn't run. </stderr_txt> http://www.gpugrid.net/result.php?resultid=18686175 The messages don't appear immediately after the crashes, but a few hours later. And I just attached this machine to GPUGrid a couple of days ago. So it didn't take long. | |
ID: 50427 | Rating: 0 | rate: / Reply Quote | |
(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.) Do you think i have to use a physical linux machine? | |
ID: 50430 | Rating: 0 | rate: / Reply Quote | |
(boboviz - I wouldn't draw conclusions from virtual machines. Even LHC has a hard time with their own stuff.) Don't know. Real Linux hasn't fixed the problem for me. But the VM's just add another layer of complexity, and from what I understand (I am not an expert), hide the various other problems. At least that is what they say on the LHC forum, where they would like to get away from VirtualBox if that were possible. It is why they developed native ATLAS, and would like to do that for the other projects if it were possible. | |
ID: 50431 | Rating: 0 | rate: / Reply Quote | |
I am running GPUGRID,both CPU an GPU, on a SuSE Linux box with a GTX 750 Ti GPU board. I am running Atlas@home of LHC, on a Windows 10 PC with 4 cores (but Task Manager says two cores and 4 logical processors on a AMD A10-6700 CPU). It has a GTX 1050 Ti GPU board, but GPUGRID overheats it to 80 C and it crashes, so I am running Atlas (no GPU but VirtuaBox 5.2.18), Einstein@home and SETI@home, both CPU and GPU. | |
ID: 50432 | Rating: 0 | rate: / Reply Quote | |
Restarts may be a problem. It's not really the output file size (which should be small), but the fact that temporary files are not deleted as the consequence of some other error. | |
ID: 50433 | Rating: 0 | rate: / Reply Quote | |
If it does, is the problem solved enabling the "keep application in memory" option? I don't think I can reliably reproduce it, but I have "Leave application in memory" enabled (as is my usual practice), and that does not prevent it. EDIT: Also, I should point out that there were no other BOINC applications running, and my machine runs 24/7, so the QC work units were never being suspended anyway. | |
ID: 50434 | Rating: 0 | rate: / Reply Quote | |
So I'm curious. When we see the results for a work unit for QC it list both a run time and a cpu time. Since we are so used to run time and cpu time beling a linear measurement, I was wondering if CPU time for the QC is actually a combined sum total for all CPU threads being used. | |
ID: 50448 | Rating: 0 | rate: / Reply Quote | |
Since the utilization of the CPU is so low on these WUs I presume it doesn't count most of the run time as CPU time because the CPU is waiting for the hard drive. | |
ID: 50449 | Rating: 0 | rate: / Reply Quote | |
I thought I would try my Ryzen 1700 on QC (Ubuntu 18.04), to see if it would behave differently than my Intel machines (i7-4770, i7-8700). CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/r/noarch/repodata.json.bz2> This may be a somewhat different error message than the others, but it seems to me that they are all communications-related. I suspect it has to do with the intermittent connections I have been getting to GPUGrid for the past several weeks/months, as previously discussed. http://www.gpugrid.net/forum_thread.php?id=4806 EDIT: The next two are running OK, and it looks like they will complete normally. It is a very intermittent problem. | |
ID: 50451 | Rating: 0 | rate: / Reply Quote | |
So I'm curious. When we see the results for a work unit for QC it list both a run time and a cpu time. Since we are so used to run time and cpu time beling a linear measurement, I was wondering if CPU time for the QC is actually a combined sum total for all CPU threads being used. I suspect that cpu time = (0.95 X N cpu's X run time + any time the computer is in use and background overhead). These darn newer WU's are so memory hungry that latency on all my machines running them is so long, the computers are becoming useless to me. I may have to quit running these until the memory issue is addressed. Both my FX 8 cores have 16GB ram and with one QC job running just 4 cores, my swap usage is as high as 7% and it takes forever to get the machine to do what I need. Plus I had to repartition the root dirs on 6 machines to accomodate the increased HD demands. Not sure how much longer I can hold out. | |
ID: 50453 | Rating: 0 | rate: / Reply Quote | |
The QC jobs should use 4GB of RAM each. If you are swapping just don't run as many in parallel. You will never finish them anyway if you end up swapping. | |
ID: 50454 | Rating: 0 | rate: / Reply Quote | |
The QC jobs should use 4GB of RAM each. If you are swapping just don't run as many in parallel. You will never finish them anyway if you end up swapping. Yes, understood and confirmed however: These last two posts by me are provided simply as FYI hopefully for the benefit of the project and are a summary of my experiences crunching the QC WU's, not a complaint. Before these newer QC WU's, I was able to run two mt jobs 4 threads each (following fixing the simultaneous start bug) and acemd or E@H concurrently on my two FX machines without any memory or latency issues. With these newer jobs, I can only run a single WU 4 threads (WCG and acemd on the remaining cores). After about 12 hours or so of run time, the swap file begins to be utilized and of course latency (my intervention to use the computer) starts increasing up to seconds before responding. I have not found out what application(s) actually use the extra ram causing the swap to be invoked (probably not the QC app because they finish quickly, usually around 15 - 30 mins) but the swap usage gradually increases with time with 7% usage being highest observed to date. The swap does not appear to be utilized consistently but rather for short increments of time even when I am not using the machine. Is it possible that the QC app is returning most but not all memory it uses back to the memory pool as calc's are completed? I have 6 computers running QC with 4 being 4 core headless crunchers only and as long as they provide valid results I leave them alone but the two FX machines with consoles, the latancy issue leaves me with little choice but to consider stop running the QC apps on the FX's, or perhaps cut them down to 2 or 3 threads to see if that works. I will try the latter before I stop QC on the FX's but that is going increase turnaround time and undo the benefits of using the extra ram and multi-cores to speed turnaround time seemingly causing a catch 22 situation (2 concurrent WU's 4 threads each taking longer but returning 2 WU's/real time vs running a single WU 2 threads quicker but returning 1 WU/real time). | |
ID: 50464 | Rating: 0 | rate: / Reply Quote | |
Is it possible that the QC app is returning most but not all memory it uses back to the memory pool as calc's are completed? That is an interesting question, and could explain some of the random errors I have been getting. But QC is running OK now on an i7-3770, running four work units at a time with 2 cores per WU. I see memory usage up around 4 GB per work unit though, so it is fortunate I have 32 GB. That has not prevented the errors in the past on comparable machines, but I have found that when it works, don't touch it. | |
ID: 50465 | Rating: 0 | rate: / Reply Quote | |
Same here, don't mess with a working situation. I should have gone full 32 GiB when I last upgraded RAM. Darn, I went 4 x 2 initially and later added 4 x 2 more so now I have to buy all 32 G (8 x 4) rather than just add 16 G more, to the tune of around 250 USD each FX box. My rule has been 2 G per thread/core but in this situation 4 G /thread appears minimum, in fact, all my ATX can take. While running 4 WU's 2 threads each, have you noticed any swap file usage with the 32 G RAM? | |
ID: 50478 | Rating: 0 | rate: / Reply Quote | |
While running 4 WU's 2 threads each, have you noticed any swap file usage with the 32 G RAM? I have set swappiness to never use swap: sudo sysctl vm.swappiness=0 But I don't think I would notice it anyway, since it is a dedicated machine and I don't have a way to check it. But whenever I run "free", I always see plenty of free/available memory. Currently, it is 3 GB free, and 22 GB available, but it varies a lot. However, I haven't seen less than 18 GB available. | |
ID: 50480 | Rating: 0 | rate: / Reply Quote | |
I should have gone full 32 GiB when I last upgraded RAM. Darn, I went 4 x 2 initially and later added 4 x 2 more so now I have to buy all 32 G (8 x 4) rather than just add 16 G more, to the tune of around 250 USD each FX box. My rule has been 2 G per thread/core but in this situation 4 G /thread appears minimum, in fact, all my ATX can take. Just leave it at the default 4 cores per work unit. I need the extra memory only because I am using 2 cores per work unit, and then running 4 work units at a time. | |
ID: 50481 | Rating: 0 | rate: / Reply Quote | |
Just leave it at the default 4 cores per work unit. Experimenting, I went to 3 cores with only 1 QC job at a time and that helped just about eliminate the user latency issues. I can live with it now but as expected, the real times increased. Now, after a fresh boot, I have plenty of free memory but over time it starts pushing toward the limit. Would be interesting to find out if the QC app is faithfully returning all memory used back to the system after each of the calc's. | |
ID: 50482 | Rating: 0 | rate: / Reply Quote | |
Would be interesting to find out if the QC app is faithfully returning all memory used back to the system after each of the calc's. Even though I show 3 GB memory free, and 21 GB available at the moment, it still shows that 361 MB of the swap file is used (out of 2 GB total), even though I have swappiness set to 0. I don't know what that means, but the machine has not been rebooted for three days, and still has plenty of memory left. But it could be using it up. (It is getting hard to post again, due to all the browser timeouts. It is a wonder I am able to get work at all.) | |
ID: 50483 | Rating: 0 | rate: / Reply Quote | |
(It is getting hard to post again, due to all the browser timeouts. It is a wonder I am able to get work at all.) I hear you re the website. I have been having the same issues for several days with the browser timeouts and having to reesend data just to complete a post. On the swap file issue on your machine, try swapoff -a as root or sudo. That for sure disables swapping. I use it to flush the swap. Use swapon -a to re-enable swap. Edit: I use a RPM disto so not sure if swapoff is available to a DEB linux. | |
ID: 50484 | Rating: 0 | rate: / Reply Quote | |
On the swap file issue on your machine, try swapoff -a as root or sudo. That for sure disables swapping. I use it to flush the swap. Use swapon -a to re-enable swap. Very good. I did sudo swapoff -a, and now swap shows as zero. We will see how it goes. | |
ID: 50485 | Rating: 0 | rate: / Reply Quote | |
I have a GPU task and a CPU task running on my two cores Opteron 1210, 8 GB DDR2 RAM, GTX 750 Ti at 1202 MHz clock, 61 C. OS is SuSE Leap 42.3.Swap space is used at 37% of 2 GB. My HP Linux laptop, also running SuSE Leap 15.0 has a 7 GB swap space, not used. It is not running GPUGRID tasks because BOINC space is only 30 GB instead of the 760 GB of my main Linux host, a 2008 vintage SUN workstation, running 24/7 since January 2008. Hats off to SUN! | |
ID: 50486 | Rating: 0 | rate: / Reply Quote | |
Very good. I did sudo swapoff -a, and now swap shows as zero. We will see how it goes. Remember when you reboot, you will have to use the command again unless you write a script and set it up to auto run at boot. I have a GPU task and a CPU task running on my two cores Opteron 1210, 8 GB DDR2 RAM, GTX 750 Ti at 1202 MHz clock, 61 C. OS is SuSE Leap 42.3.Swap space is used at 37% of 2 GB. My HP Linux laptop, also running SuSE Leap 15.0 has a 7 GB swap space, not used. How is your user latency (I define as delay between user requests and computer response)? One of my FX computers has been running QC jobs about 20 hours since a fresh reboot and already swap usage is at 4% (333 of 8047 MiB) and I am only using 3 threads of 8 available. I open boincmgr and it is taking up to 10 seconds to communicate to localhost. I am really suspecting this newer QC app is leaving some small percentage of dirty ram pages behind after completing the calc's and they add up over time to swap usage and user latency. | |
ID: 50490 | Rating: 0 | rate: / Reply Quote | |
I am using this computer to read my mail, to navigate the WWW, read the newspapers including the NYTimes which leaves me ten free articles/month and I feel no delay. I have a 30 Mbit/s mixed fiber/copper connection by Telecom Italy, which means the the fiber reaches a cabinet that is not far from my home, then I have a copper connection. My router is also connected by WiFi to a Windows 10 PC, a printer and a decoder which gives me SKY TV on the TV set which is also the monitor of the Windows PC. I just had a Microsoft update on the Windows 10 PC and it restarted with its two Einstein@home tasks. | |
ID: 50491 | Rating: 0 | rate: / Reply Quote | |
Concerning returning the memory when the WU is over: that's sure. The OS enforces that. | |
ID: 50492 | Rating: 0 | rate: / Reply Quote | |
Doesn't happen alot but i still am getting these occasionally that cause errors.
| |
ID: 50493 | Rating: 0 | rate: / Reply Quote | |
General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps? | |
ID: 50494 | Rating: 0 | rate: / Reply Quote | |
Doesn't happen alot but i still am getting these occasionally that cause errors. Those are my bane too. Nothing can be done about them. Normal feedback is to don't quit BOINC just as a task is finishing up. But it happens even when you never quit BOINC. It can happen for whatever reason that one task finishes up just as another project's task starts and BOINC takes too long to report the task. So the error. | |
ID: 50500 | Rating: 0 | rate: / Reply Quote | |
Toni asked: General question to windows users: do you see "black windows" like a command prompt coming up when running QC apps? Just started my first QC app on windows. About 1 minute after the task started, a "black window" flashed up on the display then immediately disappeared. About 15 minutes after the task started, there is a "python" app listed in windows task manager that I assume is the QC app. It is using all available threads on a 16 thread system. In BOINC Manager, the task shows that it should be using "4 CPU's". Let me know if you need more info. | |
ID: 50502 | Rating: 0 | rate: / Reply Quote | |
Toni asked: Ok thanks. If the flashing is not annoying, I'd leave it as it is. Regarding threads - the python app is indeed QC. Are you running multiple WUs simultaneously, or just one WU occupied all 16 threads? Thanks a lot | |
ID: 50505 | Rating: 0 | rate: / Reply Quote | |
Toni asked: Regarding threads - the python app is indeed QC. Are you running multiple WUs simultaneously, or just one WU occupied all 16 threads? I was running 1 task that took all threads. I will try to get another task and run it when one becomes available. | |
ID: 50508 | Rating: 0 | rate: / Reply Quote | |
Just had 8 tasks error out with the following error code. SafetyError: The package for hdf5 located at /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/pkgs/hdf5-1.10.2-hba1933b_1 I tried a reset on the project but that computer is now locked out for the rest of the day. Anything else I need to do? | |
ID: 50588 | Rating: 0 | rate: / Reply Quote | |
Looks like we have made headway on these work units. Just over a thousand left. Will be interesting to see if they bring back the other set. | |
ID: 51019 | Rating: 0 | rate: / Reply Quote | |
Will be interesting to see if they bring back the other set. I could use some longer ones, as long as the disc space problem is solved. (I have 170 GB free, so it should not be a problem, though that remains to be seen.) | |
ID: 51020 | Rating: 0 | rate: / Reply Quote | |
Sorry for running out. Give me some time. I'm at a conference so it's not super easy to spawn new work units right now but I'll give it a try. | |
ID: 51021 | Rating: 0 | rate: / Reply Quote | |
Actually I think I will let them run out to not resubmit the same workunits again. I will send new ones out tomorrow or the day after. | |
ID: 51022 | Rating: 0 | rate: / Reply Quote | |
No worries, it was more of a comment than a compliant. Enjoy your conference, we aren't going anyplace hahaha... | |
ID: 51023 | Rating: 0 | rate: / Reply Quote | |
Are you give us any comment about results of this batch? Was it usefull? | |
ID: 51024 | Rating: 0 | rate: / Reply Quote | |
@kain The easy answer is "yes". This is a machine learning project so the more data you throw at it the better the predictor becomes. If you want to know if we solved the problem we set out to solve then the answer is "not totally yet" although we will publish something on it very soon. | |
ID: 51040 | Rating: 0 | rate: / Reply Quote | |
I'm going to let the workunits run out again to make sure I don't recalculate them. In a few days I'll send out new ones. | |
ID: 51170 | Rating: 0 | rate: / Reply Quote | |
Ok, Thanks Stefan. Happy New Years | |
ID: 51171 | Rating: 0 | rate: / Reply Quote | |
I everything goes smooth I ought to have the last workunits out by tomorrow, or latest the day after. | |
ID: 51200 | Rating: 0 | rate: / Reply Quote | |
I everything goes smooth I ought to have the last workunits out by tomorrow, or latest the day after. No problem. We are ready when you are. But the term "the last" is ominous. Is it the end of the road for QC, or just for this experiment? | |
ID: 51203 | Rating: 0 | rate: / Reply Quote | |
No, I think there will be more in the future. But we will have to consider well what to run since the large molecules were a bit of a pain. Some molecule fragmentation might solve the issue. But there might be some downtime on QM jobs, hard to tell. | |
ID: 51204 | Rating: 0 | rate: / Reply Quote | |
That is perfectly OK. It is much better for projects to tell us what is going on than to leave us hanging. You never know if they are dead or alive. At least you are alive. | |
ID: 51209 | Rating: 0 | rate: / Reply Quote | |
So, seems like it will take 2 more days or so since other people are running stuff on the machine that calculates the workunits. It's a bit of a massive amount of data so bear with it a few more days | |
ID: 51214 | Rating: 0 | rate: / Reply Quote | |
which machine do you use for calculating the WU? Just in case some of us are willing to donate some money for hardware upgrades at your institute. | |
ID: 51215 | Rating: 0 | rate: / Reply Quote | |
Although I love the support, I feel like it's an exercise in futility. Eventually all machines we have become overrun by one or more of our lab members. No machine is safe :D Even if I had one dedicated to myself I might use it for something else while not calculating WUs (would be a shame to have it idle) and then when we run out of WUs I would need to deal with the same issue. I guess we'll just have to wait one or two more days. It's roughly 80% done. Worst case if it's still not done in two days I'll talk with the lab member currently squeezing the living juices out of it to stop his jobs for a moment if possible. | |
ID: 51219 | Rating: 0 | rate: / Reply Quote | |
Does QC still have a problem when multiple WUs start at the same time??? <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> Wait a minute and change 1 to 2 and tell BOINC to Read Config files. <app> <name>QC</name> <max_concurrent>2</max_concurrent> </app> Wait a minute and change 2 to 3 and tell BOINC to Read Config files. Repeat until max desired WUs is reached. Reset N to 1 in case of a reboot you restart one at a time. E.g., I have a Xeon E5-2699 v4 with 22c/44t. I could conceivably run ten 4-CPU WUs and leave 4 threads for the GPUs. When the next batch of QC WUs posts I'm going to babysit a computer and do this manually. I've never written a script so I'm reading book learning it now. If someone knows how to code this I'd be glad to test drive their script for them. | |
ID: 51220 | Rating: 0 | rate: / Reply Quote | |
Aurum asked: Does QC still have a problem when multiple WUs start at the same time??? Short answer is No. This was fixed several weeks ago. | |
ID: 51228 | Rating: 0 | rate: / Reply Quote | |
Great. How much RAM do I need per QC WU??? | |
ID: 51229 | Rating: 0 | rate: / Reply Quote | |
Is there an L3 Cache requirement for QC WUs??? | |
ID: 51231 | Rating: 0 | rate: / Reply Quote | |
Today I'll upload a new batch. I have no real clue on the technical requirements though. It will be try-and-see because these are some last large molecules I need to run. After these large ones I'll send some last small and fast ones again. | |
ID: 51232 | Rating: 0 | rate: / Reply Quote | |
Wow! This is a major RAM hog. I just watched while 6 4C WUs consumed 16 GB and crashed a computer. Then another had 3 4C WUs running and it stopped one them as Suspended:Wating for memory, then a second one and third is still using 12 of 16 GB. Then they all 3 come back on and use all the memory. Then start failing for computation error. <app> <name>QC_beta</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC_beta</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--fetch_minimal_work</cmdline> </app_version> | |
ID: 51234 | Rating: 0 | rate: / Reply Quote | |
I got two of them. One took 2.5 GB, and the other 3.5 GB. Send more. I need to use the 32 GB of memory on my i7-4770. | |
ID: 51235 | Rating: 0 | rate: / Reply Quote | |
Still beta WU for now. Disk usage may be large (and we can't tune it). Memory should be <4 GB per WU. Threads, up to 4, controlled by boinc. There will be more. Thanks! | |
ID: 51236 | Rating: 0 | rate: / Reply Quote | |
I have some TONI WUs that have very low CPU usage and very low RAM usage. | |
ID: 51237 | Rating: 0 | rate: / Reply Quote | |
Another thing I noticed is my GPU usage, normally at 100% running GPUGrid is now 5-15% and the CPU is not anywhere near 100% usage. | |
ID: 51238 | Rating: 0 | rate: / Reply Quote | |
Today I'll upload a new batch. I have no real clue on the technical requirements though. It will be try-and-see because these are some last large molecules I need to run. After these large ones I'll send some last small and fast ones again. No joy with my computers in getting any work. Since they are remote I can't access them to request new work units. Shouldn't be an issue with 1 machine as it has 64GB ram and 4 TB hard drive. The other has 32 GB but also a 4 TB HDD. Just waiting to see if they get any work units to crunch. ____________ | |
ID: 51239 | Rating: 0 | rate: / Reply Quote | |
No joy with my computers in getting any work. Do you have "Run test applications?" selected in your preferences? I am getting them fairly regularly now. | |
ID: 51242 | Rating: 0 | rate: / Reply Quote | |
Can you set it so that each Host gets one beta before a single Host gets 87 ??? | |
ID: 51243 | Rating: 0 | rate: / Reply Quote | |
No joy with my computers in getting any work. Had the beta but not the "Run test applications" selected. Thanks for that. Made the change, will see if that works. ____________ | |
ID: 51246 | Rating: 0 | rate: / Reply Quote | |
Each BOINC slot with a QC job is about 29 GB, so plan accordingly. I am running two now, which uses all eight cores and still leaves 29 GB free, so I think I am safe for now. | |
ID: 51247 | Rating: 0 | rate: / Reply Quote | |
Current test WUs are limited to 30GB, but many are failing because the limit is hit. I will need to raise the limit (the production QC app is 60 GB). | |
ID: 51249 | Rating: 0 | rate: / Reply Quote | |
I will need to raise the limit (the production QC app is 60 GB). I will put in a 256 GB SSD, but I have a 500 GB sitting on the shelf too. Let us know what you need. | |
ID: 51251 | Rating: 0 | rate: / Reply Quote | |
Two tasks have "restarted" themself and started run from th beginning when they achieved about 40%. Just before it my CPU has heated suddenly to over 80 Celsius degrees, fortunately only for a moment. | |
ID: 51252 | Rating: 0 | rate: / Reply Quote | |
The default BOINC settings for disk memory are far too low for your experiment. The RAM requirements are all over the map. You're sending these jobs to Xeons with 24 to 44 logical CPUs that can only run one or two at a time. | |
ID: 51254 | Rating: 0 | rate: / Reply Quote | |
The default BOINC settings for disk memory are far too low for your experiment. Good point. I don't remember what the defaults are, but I always increase them anyway. They should probably create a sticky on how to set up BOINC for this project. Also, a sample app_config.xml file on how to limit the number of work units operating at a time would be helpful. Some projects allow you to limit the number of work units downloaded at at time (and even the number of cores to be used per work unit) on their preferences page. That would be even easier. I have trouble connecting here too. It seems to be a problem for everyone outside of Spain, but especially in the U.S. | |
ID: 51255 | Rating: 0 | rate: / Reply Quote | |
My windows machine is doing QC WUs, is this supposed to happen? | |
ID: 51256 | Rating: 0 | rate: / Reply Quote | |
PappaLito said: My windows machine is doing QC WUs, is this supposed to happen? Could be, I also have 2 running on a Windows machine. | |
ID: 51258 | Rating: 0 | rate: / Reply Quote | |
So about half are failing the other half are finishing. They are using all of the threads on my CPUs. I had a max concurrent in my app_config but I think these are a different name so that limitation is being ignored. I should have tried a <project_max_concurrent> but didn't have time to play with it to see if it would restrict the number running. Guess that will have to wait until Saturday when I can re-access those machines. | |
ID: 51261 | Rating: 0 | rate: / Reply Quote | |
So about half are failing the other half are finishing. They are using all of the threads on my CPUs. Even with a limit on the number of work units running, some of them will fail anyway with a "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" error. They need more disk memory than they are allowed at the moment, as Toni mentioned below. | |
ID: 51263 | Rating: 0 | rate: / Reply Quote | |
Sorry to hijack again this thread. I might make a new one but since I made here the original post: the current WUs I sent to the QC app (not the beta) should run fine. They are the same molecules as before. I changed my mind about submitting the other monsters, I'll try to run them on our local cluster first. | |
ID: 51272 | Rating: 0 | rate: / Reply Quote | |
I changed my mind about submitting the other monsters, I'll try to run them on our local cluster first. Well I was just about to put in a 500 GB SSD, but I will wait until you give us the heads up. | |
ID: 51273 | Rating: 0 | rate: / Reply Quote | |
After some times i restart my linux box and.... 226 mb of upload of a "Toni" quantum chemistry wu!! | |
ID: 51278 | Rating: 0 | rate: / Reply Quote | |
The QC per core RAM load has dropped. What's today's requirement??? | |
ID: 51279 | Rating: 0 | rate: / Reply Quote | |
What are the requirements to receive Regular Non BETA QC Wu's ??? I have never received 1 even though I have been calling for them all along. I can get the BETA QC Wu's so one would think I should be able get the Regular QC Wu's. Running Windows 10 Pro with 64gb's Memory on a Intel i7 7770k CPU. | |
ID: 51287 | Rating: 0 | rate: / Reply Quote | |
STEVE, Have you gone to BOINCmgr/Options/Computing Preferences/Disk and Memory??? The combination of those 3 checkboxes under Disk must yield over 60 GB of free disk space. May be 60 GB per QC WU. | |
ID: 51292 | Rating: 0 | rate: / Reply Quote | |
The combination of those 3 checkboxes under Disk must yield over 60 GB of free disk space. May be 60 GB per QC WU. As I read Stefan's post, they have backed off from the large ones, and have gone back to the earlier ones. So 30 GB (per work unit) should do; that is what I have been getting. http://www.gpugrid.net/forum_thread.php?id=4785&nowrap=true#51272 It is entirely possible that they will go to the big ones (or even bigger) in the future, unless they can figure out how to break up the work into smaller pieces. | |
ID: 51293 | Rating: 0 | rate: / Reply Quote | |
STEVE, Have you gone to BOINCmgr/Options/Computing Preferences/Disk and Memory??? The combination of those 3 checkboxes under Disk must yield over 60 GB of free disk space. May be 60 GB per QC WU. I have a 1 Terabyte HD that I run Boinc on, there is 878gb of free space on it, there should be more than enough space to run the Wu's. Disk: use at most 750 GB Disk: leave free at least 1 GB Values smaller than 0.001 are ignored GB Disk: use at most 100 % of total Tasks checkpoint to disk at most every 60 seconds Swap space: use at most 95% of total Memory: when computer is in use, use at most 95% of total Memory: when computer is not in use, use at most 98% of total ____________ STE\/E | |
ID: 51294 | Rating: 0 | rate: / Reply Quote | |
STEVE asked, What are the requirements to receive Regular Non BETA QC Wu's ??? As far as I know, regular non BETA QC WU's are still Linux only. I am getting plenty on my Linux box, none on a Windows box. | |
ID: 51296 | Rating: 0 | rate: / Reply Quote | |
STEVE asked, Yeah that's what I was afraid of, guess they don't need the Wu's run very fast if it's Linux only. Why would they test the BETA Wu's on Windows Machines but not run the Regular Wu's on them, weird ... ____________ STE\/E | |
ID: 51297 | Rating: 0 | rate: / Reply Quote | |
STEVE, | |
ID: 51298 | Rating: 0 | rate: / Reply Quote | |
Well I tried to install a <project_max_concurrent> into the app_config but the beta work units just ignore that restriction. They were utilizing all 20 threads on the computer and starving the GPUs for resources. So I've stopped the beta for now on the machines. I have to rebuild a computer anyway this week so I think I will strip out the GPUs and leave the 16 thread CPU and install a 4 TB HDD. That should be enough I think for the beta work units and I will leave that machine only for those work units. | |
ID: 51301 | Rating: 0 | rate: / Reply Quote | |
Well I tried to install a <project_max_concurrent> into the app_config but the beta work units just ignore that restriction. You could try: <app_config> <app> <name>QC_beta</name> <max_concurrent>4</max_concurrent> </app> </app_config> That has worked for me in the past. | |
ID: 51302 | Rating: 0 | rate: / Reply Quote | |
Jim1348 wrote: That has worked for me in the past. I tested yesterday Quantum Chemistry Beta for Windows and QC_beta app name works for limiting concurrent computing WUs. | |
ID: 51303 | Rating: 0 | rate: / Reply Quote | |
Using <project_max_concurrent> is a problem since it counts all GG WUs. Also, they ran out of GPU WUs yesterday. The control of CPU & GPU projects should be completely separate. <app_config> <app> <name>acemdlong</name> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <fraction_done_exact>1</fraction_done_exact> </app> <app> <name>acemdshort</name> <gpu_versions> <cpu_usage>1.0</cpu_usage> <gpu_usage>1.0</gpu_usage> </gpu_versions> <fraction_done_exact>1</fraction_done_exact> </app> <app> <name>QC</name> <max_concurrent>8</max_concurrent> <fraction_done_exact>1</fraction_done_exact> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> <app> <name>QC_beta</name> <max_concurrent>1</max_concurrent> <fraction_done_exact>1</fraction_done_exact> </app> <app_version> <app_name>QC_beta</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config> | |
ID: 51306 | Rating: 0 | rate: / Reply Quote | |
Well I tried to install a <project_max_concurrent> into the app_config but the beta work units just ignore that restriction. Thanks Jim, I'll give that a shot later this week when I get back to those computers. Since there doesn't appear to be any more beta at the moment. I"ll put that in the app_config for when and if those betas return. ____________ | |
ID: 51309 | Rating: 0 | rate: / Reply Quote | |
Two tasks have been frozen at 1,098% despite they ahievied it very quickly. I dont't think the app is good, yet... | |
ID: 51312 | Rating: 0 | rate: / Reply Quote | |
Two tasks have been frozen at 1,098% despite they ahievied it very quickly. I dont't think the app is good, yet... None of the BETA Wu's freeze up on my Box but it's 50/50 whether they get a computation error or not, They all seem to run 30 Min's even if they get the computation error, some finish okay but show up as Invalid here at the site ... ____________ STE\/E | |
ID: 51316 | Rating: 0 | rate: / Reply Quote | |
Two tasks have been frozen at 1,098% despite they ahievied it very quickly. I dont't think the app is good, yet... My "_TONI_" wus for linux stuck at 10% and time remaining continue to grow up | |
ID: 51324 | Rating: 0 | rate: / Reply Quote | |
Off to a rough start but think I got the kinks worked out. Got my 16 thread i7 on an Alphacool 360mm Radiator. 3 work units at a time with 4 threads, holding around 61-62C. Will see how it runs over the next 24 hours. | |
ID: 51382 | Rating: 0 | rate: / Reply Quote | |
So this is the biggest one that I've crunched so far. | |
ID: 51464 | Rating: 0 | rate: / Reply Quote | |
Starting to see lot of big ones lately but still none of those Giant one that we had a few months ago. | |
ID: 51481 | Rating: 0 | rate: / Reply Quote | |
Starting to see lot of big ones lately but still none of those Giant one that we had a few months ago. It is not as much fun working on the small ones. Maybe when they get the Windows app developed, they will have enough users that they can offer "small" and "large" work units, as defined by disk requirements (or even main memory). | |
ID: 51482 | Rating: 0 | rate: / Reply Quote | |
Starting to see lot of big ones lately but still none of those Giant one that we had a few months ago. Are they working on a Windows app? Thanks! | |
ID: 51483 | Rating: 0 | rate: / Reply Quote | |
Are they working on a Windows app? Very much so. I think they are making progress. http://www.gpugrid.net/forum_thread.php?id=4790 | |
ID: 51484 | Rating: 0 | rate: / Reply Quote | |
Current test WUs are limited to 30GB, but many are failing because the limit is hit. I will need to raise the limit (the production QC app is 60 GB). Wow, these things are going to be rough on SSD drive life. How large are the uploads and downloads? | |
ID: 51489 | Rating: 0 | rate: / Reply Quote | |
Wow, these things are going to be rough on SSD drive life. How large are the uploads and downloads? They have gone back to small ones. Running 11 of them on my i7-8770 (single core), my project folder is only 2.1 GB. It is no fun at all. | |
ID: 51492 | Rating: 0 | rate: / Reply Quote | |
Current test WUs are limited to 30GB, but many are failing because the limit is hit. I will need to raise the limit (the production QC app is 60 GB). You could do like I did and install HDD. Those are fairly reasonable in price.Still have OS on the SSD but have BOINC installed on the HDD. Or if it's a new system, just install a HDD. I think I got 4 TB HDD for like $110 USD ____________ | |
ID: 51493 | Rating: 0 | rate: / Reply Quote | |
I've had a few "error while computing" for QC tasks on linux recently. I had 4 on Feb 14th and 3 on Feb 13th but none today yet. <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 17:04:03 (7591): wrapper (7.7.26016): starting 17:04:03 (7591): wrapper (7.7.26016): starting 17:04:03 (7591): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda && /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p qmml3 --override-channels -c defaults -c gpugrid --file requirements.txt ") Python 3.6.5 :: Anaconda, Inc. # >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<< `$ /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p qmml3 --override-channels -c defaults -c gpugrid --file requirements.txt` environment variables: CIO_TEST=<not set> CONDA_ROOT=/var/lib/boinc-client/projects/www.gpugrid.net/miniconda PATH=/usr/bin:/bin REQUESTS_CA_BUNDLE=<not set> SSL_CERT_FILE=<not set> active environment : None user config file : /var/lib/boinc-client/.condarc populated config files : conda version : 4.5.4 conda-build version : not installed python version : 3.6.5.final.0 base environment : /var/lib/boinc-client/projects/www.gpugrid.net/miniconda (writable) channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/free/linux-64 https://repo.anaconda.com/pkgs/free/noarch https://repo.anaconda.com/pkgs/r/linux-64 https://repo.anaconda.com/pkgs/r/noarch https://repo.anaconda.com/pkgs/pro/linux-64 https://repo.anaconda.com/pkgs/pro/noarch https://conda.anaconda.org/gpugrid/linux-64 https://conda.anaconda.org/gpugrid/noarch package cache : /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/pkgs /var/lib/boinc-client/.conda/pkgs envs directories : /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs /var/lib/boinc-client/.conda/envs platform : linux-64 user-agent : conda/4.5.4 requests/2.18.4 CPython/3.6.5 Linux/4.15.0-45-generic linuxmint/19.1 glibc/2.27 UID:GID : 122:129 netrc file : None offline mode : False V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V V CondaHTTPError: HTTP 504 GATEWAY_TIMEOUT for url <https://conda.anaconda.org/gpugrid/linux-64/repodata.json> Elapsed: 00:59.680897 CF-RAY: 4a92d5abb8af46f2-EWR A remote server error occurred when trying to retrieve this URL. A 500-type error (e.g. 500, 501, 502, 503, etc.) indicates the server failed to fulfill a valid request. The problem may be spurious, and will resolve itself if you try your request again. If the problem persists, consider notifying the maintainer of the remote server. A reportable application error has occurred. Conda has prepared the above report. Upload successful. 17:06:11 (7591): /usr/bin/flock exited; CPU time 14.574405 17:06:11 (7591): app exit status: 0x1 17:06:11 (7591): called boinc_finish(195) </stderr_txt> ]]> | |
ID: 51495 | Rating: 0 | rate: / Reply Quote | |
Yeah I got a bunch of those as well. I think it's having an issue either downloading or connecting to get the required data. Doesn't happen often but when does, it usually results in numerous errors. | |
ID: 51497 | Rating: 0 | rate: / Reply Quote | |
I am running QC on a HP Laptop with E-450 CPU, 8 GB RAM, a 1 TB hard disk which has a SSD partition of 8 GB. No problem so far after two SSD disks on the same laptop failed one after the other running BOINC tasks. | |
ID: 51574 | Rating: 0 | rate: / Reply Quote | |
A remote server error occurred when trying to retrieve this URL. I get around a 1% to 2% error rate on those, so it happens rarely. But they usually complete on other machines, so it is not the work units but the communications somewhere. | |
ID: 51575 | Rating: 0 | rate: / Reply Quote | |
Upgrade my main CPU cruncher to 420 mm Rad with push pull and 4 TB HDD. Need to tweek it just a bit but seems stable at 5 work units with 4 threads apiece. Would include a pic but would take up a lot of space. | |
ID: 51698 | Rating: 0 | rate: / Reply Quote | |
Message boards : Multicore CPUs : New QC app