Message boards : Multicore CPUs : New batch of QC tasks (QMML)
Author | Message |
---|---|
These are called QMML, and rather experimental (more dependencies). Let's see how they work. | |
ID: 48356 | Rating: 0 | rate: / Reply Quote | |
Toni, /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/envs/qmml/lib/python3.6/site-packages/tables/path.py:112: NaturalNameWarning: object name is not a valid Python identifier: '122'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though You should be able to see all of them when the task uploads. Let me know if you need more info. | |
ID: 48358 | Rating: 0 | rate: / Reply Quote | |
Thanks, the warnings are expected and harmless. | |
ID: 48359 | Rating: 0 | rate: / Reply Quote | |
Subj: Observation of QC CPU WU's Core Utilization | |
ID: 48362 | Rating: 0 | rate: / Reply Quote | |
Mine is also just a bit less than 1 thread with 7 available for CPU usage. Good thing I have another client available to keep the CPU busy. If the progress bar is correct it will take about 9.5 hours on 1950x at 3.75 GHz. | |
ID: 48363 | Rating: 0 | rate: / Reply Quote | |
The next batch (QMML313a) should respect the number of threads requested by your client. | |
ID: 48364 | Rating: 0 | rate: / Reply Quote | |
The Multiple threaded work units that were sent out last month worked fine for me with no issues. | |
ID: 48365 | Rating: 0 | rate: / Reply Quote | |
Hmm 1st one completed for me and the 2nd one is at around 86%. | |
ID: 48366 | Rating: 0 | rate: / Reply Quote | |
Toni said, The next batch (QMML313a) should respect the number of threads requested by your client. It looks like this one does respect the number of threads requested by my client. My app_config specifies 4 threads and it looks to be using 4 threads. Let me know if you need more info. | |
ID: 48370 | Rating: 0 | rate: / Reply Quote | |
I am not able to get QC on my Ryzen 1700 machine running Ubuntu 17.10. I just get a "No tasks are available for Quantum Chemistry" message when I request them. However, I am able to get QC on my i7 3770 machine running Ubuntu 16.04 (both machines have BOINC 7.8.3). Both machines are set to the same profile (work), so they should be treated identically. | |
ID: 48371 | Rating: 0 | rate: / Reply Quote | |
The reason why otherwise similar machines get/do not get work completely baffles me. I don't think it's related to the maker of the CPU. Perhaps with the history of tasks/host reliability or somesuch. With this respects BOINC is of no help. | |
ID: 48373 | Rating: 0 | rate: / Reply Quote | |
These are called QMML, and rather experimental (more dependencies). Let's see how they work. I would really like to get some tasks but seems they are not being given out ATM http://www.gpugrid.net/show_host_detail.php?hostid=457056 Been Trying for awhile now. Intel(R) Core(TM) i7-3970X ____________ Crunching@EVGA The Number One Team in the BOINC Community. Folding@EVGA The Number One Team in the Folding@Home Community. | |
ID: 48374 | Rating: 0 | rate: / Reply Quote | |
The reason why otherwise similar machines get/do not get work completely baffles me. I have seen many instances of it myself over the years, but had hoped the latest BOINC clients were past that. Unfortunately not. | |
ID: 48375 | Rating: 0 | rate: / Reply Quote | |
My 1950x also on 17.10 is getting tasks. | |
ID: 48376 | Rating: 0 | rate: / Reply Quote | |
Most of the Quantum Chemistry v3.14 (mt) fail on my AMD 1700x. v3.13 (mt) worked more or less. | |
ID: 48378 | Rating: 0 | rate: / Reply Quote | |
I understand that seing cancelled WUs is not nice, but it saves future crunching and network bandwidth (both server and client) that would otherwise be lost. Also, I thought that the function we use only cancelled UNSENT or un started wus. | |
ID: 48379 | Rating: 0 | rate: / Reply Quote | |
Hey friends, in case you need some more machines for testing, I can set up another one that is Linux based. As I have seen in the below conversation, there might be some issues with the CPU type. So .. I have both brands available for testing. Which one would help you most, Intel or AMD? | |
ID: 48380 | Rating: 0 | rate: / Reply Quote | |
Quite a few errors on 3.14 so I stopped running QC. Seg faults on AMD and Intel machines. | |
ID: 48381 | Rating: 0 | rate: / Reply Quote | |
So I did get some Tasks but it seems that only AMD Processors can run them. | |
ID: 48383 | Rating: 0 | rate: / Reply Quote | |
I received ~60 new WUs yesterday, but I didn't see what happened with them. I was surprised when I went back to the computer an hour or so later and they had all disappeared. | |
ID: 48384 | Rating: 0 | rate: / Reply Quote | |
I'm not getting any WUs, neither on my AMD, nor on my Intel CPUs. | |
ID: 48386 | Rating: 0 | rate: / Reply Quote | |
Let me summarize the current status. We are making tests in view of a large production run. | |
ID: 48388 | Rating: 0 | rate: / Reply Quote | |
The Multiple threaded work units that were sent out last month worked fine for me with no issues. Just an update as I am still am getting these errors, this is the full error CondaValueError: prefix already exists: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml ERROR conda.core.link:_execute_actions(337): An error occurred while installing package 'psi4::gcc-5-5.2.0-1'. LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1 running your command again with `-v` will provide additional information location of failed script: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml/bin/.gcc-5-post-link.sh ==> script messages <== <None> Attempting to roll back. LinkError: post-link script failed for package psi4::gcc-5-5.2.0-1 running your command again with `-v` will provide additional information location of failed script: /home/xxxxxxxx/BOINC/projects/www.gpugrid.net/miniconda/envs/qmml/bin/.gcc-5-post-link.sh ==> script messages <== <None> Traceback (most recent call last): File "pre_script.py", line 20, in <module> raise Exception("Error installing psi4 dev") Exception: Error installing psi4 dev 10:18:33 (23979): $PROJECT_DIR/miniconda/bin/python exited; CPU time 69.668408 10:18:33 (23979): app exit status: 0x1 10:18:33 (23979): called boinc_finish(195) Conan | |
ID: 48390 | Rating: 0 | rate: / Reply Quote | |
@conan: do you have "gcc" installed in your system? If not, can you try to install it? | |
ID: 48392 | Rating: 0 | rate: / Reply Quote | |
@conan: do you have "gcc" installed in your system? If not, can you try to install it? It was installed on on one computer with Fedora 25, but was not installed on the other two with Fedora 16 and Fedora 21, all 64 bit. Have installed now and await to see what happens. Versions range from 4.6.3-2 (Fedora 16), 4.9.2-6 (Fedora 21) to 6.4.1-1 (Fedora 25). Thanks Conan | |
ID: 48393 | Rating: 0 | rate: / Reply Quote | |
Another different problem is task distribution by the BOINC scheduler. First of all, as said above, some hosts are ignored for no reason I can fathom. Another is that some hosts are "soaking up" dozens of WUs, which means they are not available to others. I am hoping that both problems will sort out by themselves with a sufficiently large batch. Yes! I just got some QC on my Ryzen 1700. All good things come to those that wait. (The first four errored out after a couple of minutes, but the fifth one is running fine after 50 minutes and I think it will fly, running two cores on each WU.) | |
ID: 48394 | Rating: 0 | rate: / Reply Quote | |
I may have understood the problem of hosts not getting WUs. I was sending tasks at a high priority, which means they crossed the threshold to only go to "reliable hosts" -- a questionable heuristic. | |
ID: 48395 | Rating: 0 | rate: / Reply Quote | |
I got my first WU today. Unfortunately the WU needs 4,7 GB of ram. Can you optimise that? | |
ID: 48396 | Rating: 0 | rate: / Reply Quote | |
@conan: do you have "gcc" installed in your system? If not, can you try to install it? The Fedora 16 host still has the same error, but the Fedora 21 host is processing a work unit now for the last 8 hours 21 minutes and 68% done, so it looks good at this point. My Fedora 25 host has not received any work yet so can't say about that one. My WU is using 1.5 GB of RAM. Thanks Conan | |
ID: 48397 | Rating: 0 | rate: / Reply Quote | |
I may have understood the problem of hosts not getting WUs. I was sending tasks at a high priority, which means they crossed the threshold to only go to "reliable hosts" -- a questionable heuristic. This solved it for my second computer. Works on a USB Stick with Lubuntu 17.04. Unfortunatelly, crashed: http://www.gpugrid.net/result.php?resultid=16776102 | |
ID: 48398 | Rating: 0 | rate: / Reply Quote | |
@conan: do you have "gcc" installed in your system? If not, can you try to install it? This WU on the Fedora 21 host worked and completed successfully, my first of this batch. Conan | |
ID: 48400 | Rating: 0 | rate: / Reply Quote | |
@klepel - can you try installing gcc (if not already there)? | |
ID: 48401 | Rating: 0 | rate: / Reply Quote | |
https://drive.google.com/file/d/1bKmSXT4IAVTR8b-fpiGdC6Gm4szduk0X/view?usp=sharing | |
ID: 48402 | Rating: 0 | rate: / Reply Quote | |
What are the requirements for the running of Psi4? | |
ID: 48403 | Rating: 0 | rate: / Reply Quote | |
@dayle - oscillating CPU% is expected and due to the parts of the calculation which are not parallelized. Thermal throttling is unlikely imho (and I imagine it would manifests as a decrease in CPU clock, not CPU%). | |
ID: 48405 | Rating: 0 | rate: / Reply Quote | |
My WU is using 1.5 GB of RAM. 18-Dec-2017 14:07:22 [GPUGRID] Quantum Chemistry needs 4768.37 MB RAM but only 3469.24 MB is available for use. | |
ID: 48406 | Rating: 0 | rate: / Reply Quote | |
We are talking about 3 different memory use figures: | |
ID: 48407 | Rating: 0 | rate: / Reply Quote | |
@dayle - oscillating CPU% is expected and due to the parts of the calculation which are not parallelized. Thermal throttling is unlikely imho (and I imagine it would manifests as a decrease in CPU clock, not CPU%). That one is a little confusing. The remaining time also increases as a consequence. I think I aborted some unnecessarily when it appeared that they were stuck. I now just let them run. Maybe you should make a big sticky on it to catch people's attention? | |
ID: 48409 | Rating: 0 | rate: / Reply Quote | |
Did tasks just get aborted by the system? | |
ID: 48413 | Rating: 0 | rate: / Reply Quote | |
Did tasks just get aborted by the system? Yes. I just had a bunch aborted at 13:44 UTC. But there are now new ones in the pipeline. | |
ID: 48414 | Rating: 0 | rate: / Reply Quote | |
Can you please confirm that those WUs were cancelled while running and not just while waiting to start? | |
ID: 48415 | Rating: 0 | rate: / Reply Quote | |
All I can tell is that the one task I had running was a couple hours from completion the last I looked. | |
ID: 48416 | Rating: 0 | rate: / Reply Quote | |
Can you please confirm that those WUs were cancelled while running and not just while waiting to start? On my i7-4770 machine, there were 13 aborted at 13:43:51 UTC. Twelve of them show 0 elapsed time, but the other one shows 05:02:06 (19:52:01) elapsed time. They are all listed as "cancelled by server". And on an i7-3770 machine, three of them completed just after that, at 13:45:09 UTC, after running for around 24 hours or more each, and all show "cancelled by server". Finally, on my Ryzen 1700 machine, two of them completed at 13:52:55 UTC and show "cancelled by server" after running about 18 to 19 hours. So it works. EDIT: But BoincTasks shows the i7-4770 and the Ryzen 1700 machines as "Reported: OK+", so it is only on the GPUGrid status page that the true story is told apparently. | |
ID: 48417 | Rating: 0 | rate: / Reply Quote | |
It's true that running tasks are being killed. This is not what I expected. | |
ID: 48418 | Rating: 0 | rate: / Reply Quote | |
By the way: these WUs should not run 10+ hours on modern CPUs. That's strange. The i7-3770 machine and the Ryzen 1700 machine were running only 2 cores per work unit, while the i7-4770 was running 4 cores per work unit. | |
ID: 48419 | Rating: 0 | rate: / Reply Quote | |
It's true that running tasks are being killed. This is not what I expected. So far 76 tasks on my machine have been canceled. Please continue killing any task in progress if you don't want the data. No point squandering precious CPU cycles when the science/programming has moved on to a newer revision. Happy Holidays! | |
ID: 48420 | Rating: 0 | rate: / Reply Quote | |
Can you please confirm that those WUs were cancelled while running and not just while waiting to start? Yes I had one that had been running for 33,977 seconds (CPU time 140,002 seconds) and it was cancelled, as well as 2 that had not started. Just an aside to my Fedora 16 Host problems running these work units that are all failing, it is running 'gcc' 4.6. I did some reading on Psi4 and found that it seems to need gcc 4.9 or later in order to run. I have since installed this 'gcc' version on that computer and am awaiting a work unit to see if it works or not. There may still be something missing. I may just have to update Fedora 16 to something more recent. Conan | |
ID: 48421 | Rating: 0 | rate: / Reply Quote | |
I have still problems with task miniconda-installer reached time limit 360. Tried 4 tasks today with same result (other 2 task I cancelled). Have standard Fedora 26, nothing special. | |
ID: 48422 | Rating: 0 | rate: / Reply Quote | |
I downloaded 3 wus 3.14 on my vbox linux. | |
ID: 48424 | Rating: 0 | rate: / Reply Quote | |
I have still problems with task miniconda-installer reached time limit 360. Tried 4 tasks today with same result (other 2 task I cancelled). Have standard Fedora 26, nothing special. Check that SELINUX is not blocking any files from running. I had this problem on my Fedora 25 install and had to create an exception for it. Also make sure your 'gcc' packages are up to date dnf install gcc, or dnf install gcc-c++, should help if you haven't already done so. Conan | |
ID: 48425 | Rating: 0 | rate: / Reply Quote | |
@petr - miniconda is downloaded from our servers (~50 MB). After that, at the beginning, psi4 and other packages are downloaded from Anaconda's servers (only the first time). If you suspect a mixup, feel free to "reset" the GPUGRID project and everything should be deleted (and downloaded again at the next WU). Beware that it would kill running tasks! | |
ID: 48426 | Rating: 0 | rate: / Reply Quote | |
@ Toni, when these work units run do they get to a certain point then just idle along on a single core for hours on end? | |
ID: 48428 | Rating: 0 | rate: / Reply Quote | |
Run time is now approaching 14 hours and the % done has not moved, this is on a 16 core computer. I seem to recall problems on LHC/ATLAS when running on more than 8 cores, though I was not involved with the problem myself as I run only 7 cores there anyway. But you could try an app_config.xml to limit it to 8 cores. | |
ID: 48429 | Rating: 0 | rate: / Reply Quote | |
The computation is done looping over several molecules (~60 if i remember correctly). A checkpoint is written after each loop. Inside a loop there is a part which is multithreaded, and a part which is not. The relative sizes are different. So it's not strange that thread occupancy oscillates. Limiting the number of cores to, say, 4, via the client is ok. | |
ID: 48430 | Rating: 0 | rate: / Reply Quote | |
Run time is now approaching 14 hours and the % done has not moved, this is on a 16 core computer. Someone did tests there. Atlas runs best around 3/4/5 threads. More threads are not utilized very well. I don't recall any mt BOINC app utilizing all threads at 8+. It would probably have to be a straight math project that calculates more #s in parallel to do that. | |
ID: 48431 | Rating: 0 | rate: / Reply Quote | |
@Toni, Conan: Thanks to both of you. Looks like the SELINUX was blocking it. It's kind of black box for me. I have found this procedure, which I applied on my system. I hope, that I didn't open the Pandora's box instead :). | |
ID: 48433 | Rating: 0 | rate: / Reply Quote | |
@ Toni, when these work units run do they get to a certain point then just idle along on a single core for hours on end? Well I got sick of waiting for this one to finish (it had now been running for over 22 hours still on 1 core) so I created an "app_config.xml" file and inserted that in the project folder and restarted the BOINC Client and Manager. The WU reset itself to 1.098% completed, 5 hours run time and 18 days 23 hours to completion. So 17 to 18 hours run time disappeared and all processing went as well, so apparently no checkpoints. It is now running on 8 cpus instead of 16 which had stopped other work for a day. Will now see what happens. EDIT:: Just after I posted the WU has jumped to 81.741% done and to completion has now dropped to 1 day 11 minutes. So appears to be working heaps better. Conan | |
ID: 48438 | Rating: 0 | rate: / Reply Quote | |
It is now running on 8 cpus instead of 16 which had stopped other work for a day. Your host has 2 CPUs, both have 4 cores hyperthreaded, so the performance scaling will drop rapidly if you run more than 8 threads of Floating Point calculations (most of the science projects are using FP). To all multi-threaded CPU crunchers: Hyperthreaded CPUs have half as many cores as BOINC reports, so you should limit the threads utilized by the app to obtain optimal performance / reliability. | |
ID: 48440 | Rating: 0 | rate: / Reply Quote | |
If these workunits are gonna take an average of Fifty five hours of CPU time, they really shouldn't crash when I reboot the system. | |
ID: 48441 | Rating: 0 | rate: / Reply Quote | |
I finally finished a task so I can post now. <app_config> <app> <name>acemdlong</name> <max_concurrent>1</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <app> <name>acemdshort</name> <max_concurrent>1</max_concurrent> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <app> <name>QC</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>QC</app_name> <plan_class>mt</plan_class> <avg_ncpus>4</avg_ncpus> <cmdline>--nthreads 4</cmdline> </app_version> </app_config> Does anyone see anything wrong with the app_config? | |
ID: 48444 | Rating: 0 | rate: / Reply Quote | |
I finally finished a task so I can post now. Try <avg_ncpus>4.000000</avg_ncpus> The red highlighted line above, it works for me. Conan | |
ID: 48445 | Rating: 0 | rate: / Reply Quote | |
Can someone explain what the QC app shows for Status in the BOINC Manager. I had a app_config.xml loaded to limit the number of cpu cores it was supposed to use to 4. However in the Status column it showed 16C for the number of cores allotted. The app_config looks the same as mine. Did you reboot in order to activate it? In some of these multi-core projects, the Status is not updated until the next group of work units comes in after you have set the app_config. But a reboot usually fixes it. | |
ID: 48446 | Rating: 0 | rate: / Reply Quote | |
To clarify run times: all the QMML314rst wus are the same length. Even on a single core, they should not take longer than 20h maximum (on a relatively modern PC). The HTTP messages indicate a connectivity problem of course. I hope they cause a failure soon rather than remaining stuck. Re SElinux... I hope it leaves us in peace. | |
ID: 48449 | Rating: 0 | rate: / Reply Quote | |
After about 10h and reaching 69.568% app started to use only one core. What's worst it stays in that state for another 10h and perf is indicating that it's in OMP spinlock: | |
ID: 48454 | Rating: 0 | rate: / Reply Quote | |
Even on a single core, they should not take longer than 20h maximum (on a relatively modern PC). They are not behaving that well at all. I did not have any work units complete yesterday on four machines. That was running two cores each on two i7-3770s and four cores each on an i7-4770 and a Ryzen 1700. These machines are all Ubuntu 16/17, and run 24/7. http://www.gpugrid.net/results.php?userid=90514 They must loop back at some point, but I will let them run for another couple of days. By the way, posting is difficult as the website is often unaccessible for a few minutes at a time. Maybe that is related to some of problems some people are having, but I have not looked into it further. | |
ID: 48455 | Rating: 0 | rate: / Reply Quote | |
Two have just completed on my Ryzen 1700 (4 cores each). The elapsed time shows as 4 hours 10 minutes, but the CPU time is over two days. | |
ID: 48458 | Rating: 0 | rate: / Reply Quote | |
Can someone explain what the QC app shows for Status in the BOINC Manager. I had a app_config.xml loaded to limit the number of cpu cores it was supposed to use to 4. However in the Status column it showed 16C for the number of cores allotted. I reloaded the app_config via the Manager. I was afraid to reboot the machine because I had read earlier in the thread that the tasks would restart and I would lose the processing up to that point. It is normal for BOINC to identify downloaded tasks with the existing cpu/gpu resource usage at time of download. But I could swear I had the app_config in place before I finally snagged my first two tasks. Will wait and see when I can get my next task. | |
ID: 48460 | Rating: 0 | rate: / Reply Quote | |
But I could swear I had the app_config in place before I finally snagged my first two tasks. Will wait and see when I can get my next task. You have to activate the app_config. If you have BoincTasks, there is a way to read all the cc_config and app_config files for any connected machine. (I don't have it in front of me at the moment). Otherwise, a reboot will be necessary. I am having all sorts of problems with the work units, and a reboot is probably not worse than anything else at the moment. | |
ID: 48464 | Rating: 0 | rate: / Reply Quote | |
The official BOINC Manager has an option to reread config files as well. I use BOINC Tasks and a new/updated file is picked up without a reboot or client restart. | |
ID: 48465 | Rating: 0 | rate: / Reply Quote | |
There is also an option in the BOINC code that allows for the number of cpus that you want to use per host. | |
ID: 48467 | Rating: 0 | rate: / Reply Quote | |
Setting CPU % in BOINC is system and project wide. Not very good for fine tuning per project. The app_config was specifically introduced for specific project tuning and is the preferred method to control gpu and cpu usage per application. | |
ID: 48469 | Rating: 0 | rate: / Reply Quote | |
There is also an option in the BOINC code that allows for the number of cpus that you want to use per host. Setting CPU % in BOINC is system and project wide. Not very good for fine tuning per project. The app_config was specifically introduced for specific project tuning and is the preferred method to control gpu and cpu usage per application. I am not referring to the Boinc Client on your personal computer. My comments were aimed at the BOINC Server Code and therefore are relevant to fine tuning per project. The option I am referring to is meant for Multiple Threading, so you can set the number of cores that you want to run a MT work unit on. Over at Amicable Numbers I have normal for my 4 Core host, 5 cores for my 6 core host and 8 cores for my 16 core host (allowing 2 work units to run at the same time), so that none of the computers have the same setting but could if I wanted them to. Had the same issue I found here with the 16 core machine and that is why I set it to 8 cores. App_config.xml works and works well, I was offering an option especially for those of us that are not too good at creating these xml files, and to show that BOINC does have an option in its code to cover this situation. Conan | |
ID: 48471 | Rating: 0 | rate: / Reply Quote | |
I am not referring to the Boinc Client on your personal computer. They have that at LHC too, for the ATLAS project. And they used to do something similar at WCG for the CEP2 project (though that was not mt), in order to limit the high number of writes to the disk drive. I think it would be very valuable here, since it appears that limiting the number of cores will be needed for many people, and not everyone will be willing to use app_config.xml files. | |
ID: 48472 | Rating: 0 | rate: / Reply Quote | |
We'll try to limit the number of cores indeed. It requires server-side changes so may not be soon & may not work. | |
ID: 48473 | Rating: 0 | rate: / Reply Quote | |
I understood what you are saying. The only way that works is if you set up different venues for different projects. I work primarily at SETI and Einstein. The venue mechanism does not work correctly and will likely never be updated. Very low chance that any major rework of the BOINC server code happens in the future with the lack of developers. | |
ID: 48476 | Rating: 0 | rate: / Reply Quote | |
I just had to restart the BOINC client and the QMML work unit started back from 0% "fraction done" even though it had a checkpoint time of ~130000 and was at ~65%. Boo. | |
ID: 48477 | Rating: 0 | rate: / Reply Quote | |
Can anybody please try if a couple of tasks can run in simultaneously? | |
ID: 48483 | Rating: 0 | rate: / Reply Quote | |
If another one would be made available, I could try. I only ever see one task ready to be snagged. Just got one. Happy to report the change in allowed cores limit was properly applied after I changed 4 to 4.0000. | |
ID: 48500 | Rating: 0 | rate: / Reply Quote | |
Why is there such a big credit difference? | |
ID: 48511 | Rating: 0 | rate: / Reply Quote | |
I just grabbed 2 QC tasks and I will attempt to run them simultaneously tomorrow during the SETI outage. | |
ID: 48512 | Rating: 0 | rate: / Reply Quote | |
These two WU's ran concurrently on one of my FX8350 using 4 cores each. They were started within about twenty minutes of each other and finished about 4 minutes apart. | |
ID: 48514 | Rating: 0 | rate: / Reply Quote | |
I just grabbed 2 QC tasks and I will attempt to run them simultaneously tomorrow during the SETI outage. I just finished these two QC tasks run concurrently. 16798843 12932712 456812 27 Dec 2017 | 3:00:54 UTC 28 Dec 2017 | 7:57:47 UTC Completed and validated 26,162.47 103,683.40 4,277.76 Quantum Chemistry v3.14 (mt) 16798838 12932759 456812 27 Dec 2017 | 2:58:51 UTC 28 Dec 2017 | 7:57:47 UTC Completed and validated 26,201.39 103,805.90 4,284.12 Quantum Chemistry v3.14 (mt) I started them within a minute of each other using 4 cores each. I also had 3 Einstein GPU tasks running concurrently with them. System is a AMD Ryzen 1800X 16 core CPU and three Nvidia GTX 970's. Didn't appear to have any problems. Tasks ran right through with about 70% CPU utilization. | |
ID: 48520 | Rating: 0 | rate: / Reply Quote | |
Thanks @keith, thanks @starbase! | |
ID: 48521 | Rating: 0 | rate: / Reply Quote | |
Toni, well there's your answer if you discount the low population sample. I doubt that two successful runs were due to the brand of cpu, but could be wrong. As long as you are using a Client later than 7.0.40 you can use an app_config.xml file to tune the number of cores you allow the task to run on. | |
ID: 48527 | Rating: 0 | rate: / Reply Quote | |
My Ryzen 1700 machine has certainly done better than my two i7-3770 PCs (all machines on Ubuntu, run 24/7 and otherwise set up the same): | |
ID: 48532 | Rating: 0 | rate: / Reply Quote | |
Thanks for the post. Interesting. I could discount the fact that the Ryzen's have actual 8 physical cores so on paper a good head start over the Intel 4 core cpu's. But the FX-9350 earlier in the thread had a good result too with only 4 physical cores too and much more handicapped FFT registers in its modules compared to Ryzen and Intel. Need a lot more samples to definitively clarify I think. | |
ID: 48534 | Rating: 0 | rate: / Reply Quote | |
Need a lot more samples to definitively clarify I think. Yes. I am not looking at the output, but really only the error rate. All the machines are now running two cores per work unit, and only one work unit per machine, though earlier I had been running four cores on the AMD machine. And both the Intel and AMD cores have about the same speed, so the output should be comparable anyway now. I have changed one of the i7-3770 machines (GTX-1070-PC) from Ubuntu 17.10 (and BOINC 7.8.3) back to Ubuntu 16.04 (and BOINC 7.6.31). I doubt that it will make much difference, but I will let it run for a couple of weeks. If I continue to get more errors on Intel, I think I will go with just the Ryzen PC. | |
ID: 48535 | Rating: 0 | rate: / Reply Quote | |
Good experiment. I was just thinking that spreading the compute load over 4 cores is less hard compared to 100% workload over 4 cores on Intel. If the test on 2 and one cores is equally stable, just slower, it might suggest something bothersome on Intel architecture. The different OS platform could have a big effect too. | |
ID: 48537 | Rating: 0 | rate: / Reply Quote | |
It appears one cannot start two or more of these type of WU's simultaneously. One of the two errored as shown below. | |
ID: 48550 | Rating: 0 | rate: / Reply Quote | |
These are the last two, one of which errored: | |
ID: 48551 | Rating: 0 | rate: / Reply Quote | |
It appears one cannot start two or more of these type of WU's simultaneously. That is curious. Thanks for the report. The new DOMINIKs run fine on my two i7-3770s thus far, probably since they are shorter than the TONIs and don't get to the point of hanging up. And the Ryzen 1700 continues to do well. But there must be some selection process going on at the server, since it is the getting only the TONIs. They are all reissues now, but it has handled them all thus far, even a _8. That is a good idea, since it makes optimum use of each CPU type. If things continue this way, I will just let all the machines run. | |
ID: 48552 | Rating: 0 | rate: / Reply Quote | |
Looks like that multiple WUs together is ok, but starting exactly at the same time is not. I hope it is a relatively rare occurrence. In principle I could put a locking mechanism, but I am not enthusiastic because that would be inviting more failure modes (e.g. stale locks) to solve a relatively rare case. | |
ID: 48553 | Rating: 0 | rate: / Reply Quote | |
Looks like that multiple WUs together is ok, but starting exactly at the same time is not. I hope it is a relatively rare occurrence. In principle I could put a locking mechanism, but I am not enthusiastic because that would be inviting more failure modes (e.g. stale locks) to solve a relatively rare case. So every new member crunching these will get an error on their 1st task. Genius. It absolutely should be fixed. | |
ID: 48554 | Rating: 0 | rate: / Reply Quote | |
Looks like that multiple WUs together is ok, but starting exactly at the same time is not. I hope it is a relatively rare occurrence. In principle I could put a locking mechanism, but I am not enthusiastic because that would be inviting more failure modes (e.g. stale locks) to solve a relatively rare case. I got two errors at first also, but none since and I did not look at the reason. But it is over and done with, and not a problem. I think if you look hard at the logic of lock mechanisms, it is a logical impossibility to fix simultaneity problems. That is, any delay you put it will match some other starting situation, and result in an error also. You can try, but I don't think it is worth the effort either. EDIT: I would look to see if it happens again with the next batch. If so, then I would investigate something, whatever it is. But the errors were very short for me, and no real time lost. | |
ID: 48555 | Rating: 0 | rate: / Reply Quote | |
I would speculate that the probability of two or more WU's starting at exactly the same time in an unattended environment would be low however, those running multiple projects competing for the same GPU/CPU resources are usually switched at a specified time interval to give each project process time. With that in mind, it is possible for two QC jobs to finish and the boinc client nearing the end of a switch app interval, change to the other project and then when it came time to switch back to run the cpu gpugrid WU's with a queue full of QC jobs start two or more simultaneously. | |
ID: 48556 | Rating: 0 | rate: / Reply Quote | |
You are right that BOINC is not really a random environment, and if it happens in a more-or-less predictable manner, it should be possible to prevent it. We will see how often that is. | |
ID: 48557 | Rating: 0 | rate: / Reply Quote | |
You are right that BOINC is not really a random environment, and if it happens in a more-or-less predictable manner, it should be possible to prevent it. We will see how often that is. Agreed, predicable is the key. I really need to find out how the projects I crunch for work unattended because all my little headless crunchers are in process of being converted to diskless/headless cluster nodes with one of the FX system being the master. This should all prove interesting as this is the first project I've worked with that uses multicores for a single WU. | |
ID: 48569 | Rating: 0 | rate: / Reply Quote | |
When there is work or the GPUGrid servers sent out work they come in batches. Two tasks end up starting at once since they have short deadlines. I also said new members so again multiple tasks starting at once. Not everyone runs the same project all the time so that tasks have a chance to get off sequence. | |
ID: 48571 | Rating: 0 | rate: / Reply Quote | |
I have changed one of the i7-3770 machines (GTX-1070-PC) from Ubuntu 17.10 (and BOINC 7.8.3) back to Ubuntu 16.04 (and BOINC 7.6.31). I just completed two TONI work units, one on each of my i7-3770 PCs (2 cores per work unit): http://www.gpugrid.net/workunit.php?wuid=12932866 http://www.gpugrid.net/workunit.php?wuid=12932333 They had each errored out on other PCs, and I don't know why they worked on mine. But I do know that they each got stuck at 78.698% until I rebooted, and then they completed normally. However, the total Run time shown does not include the time they were stuck, which was about two hours in each case. This is no way to get work done; I can't be rebooting for each work unit. So I will have to just stop on the i7-3770 machines and continue only with the Ryzen 1700, which continues to work fine. Note that the new DOMINIK work units are no problem - if they could send only those to the Intel machines, I think the problem would be solved. | |
ID: 48578 | Rating: 0 | rate: / Reply Quote | |
I just ran a DOMINIK QC task and it ran very fast. | |
ID: 48586 | Rating: 0 | rate: / Reply Quote | |
50,000 WUs... Holy ****. If only those were GPU WUs | |
ID: 48587 | Rating: 0 | rate: / Reply Quote | |
50,000 WUs... Holy ****. If only those were GPU WUs I saw that too and downloaded 4 on a computer I haven't run any of these on. Started at the same time and boom all went to crap. I even tried to pause some to stop the error. http://www.gpugrid.net/results.php?hostid=458003 | |
ID: 48588 | Rating: 0 | rate: / Reply Quote | |
So far from what I have learned starting more than one multiple cpu job at a time in a split core senerio is that they need not be started at exactly the same time for one to error. Given not an exact start at the same time, the one started first always errors and the second one started processes to successful completion with the next WU in the queue started to also complete successfully. Four of four tries, near simultaneous starts the one started first ended up failing. Secondly, controlling the time to when the second WU is allowed to start following the first start time is up to 5 seconds as tested so far. Third, when the boinc client switches between projects, the QC WU's so far observed are completed in pairs leaving no single job left in progress (suspended) to stagger start the times. This unfortunate characteristic means "simultaneous (or nearly so) starts" cause an error whenever the client switches back to the gpugrid cpu jobs with a queue larger than one WU. Guess the only way to prevent this behavior is to not split cores, especially for unattended clients. | |
ID: 48589 | Rating: 0 | rate: / Reply Quote | |
I didn't have that experience with the two TONI tasks I started simultaneously. Or within the 5 second window you described. Both completed successfully. I am limiting core usage to four with an app_config file. Limiting the max_concurrent to 1 now since I also crunch SETI cpu tasks on that computer. I ran the two concurrent jobs when Toni requested users to try that experiment. | |
ID: 48590 | Rating: 0 | rate: / Reply Quote | |
Wow, the credit awarded is all over the place for these DOMINIK tasks. Obviously NOT tied to computation time or resources used for compute. | |
ID: 48591 | Rating: 0 | rate: / Reply Quote | |
I just got a task and it finished on a dual e5-2450l 32g ram server fine. | |
ID: 48592 | Rating: 0 | rate: / Reply Quote | |
resultid=16815470 | |
ID: 48593 | Rating: 0 | rate: / Reply Quote | |
Hello Keith, Wow, the credit awarded is all over the place for these DOMINIK tasks. Obviously NOT tied to computation time or resources used for compute. Did you observe this behavior multiple times? Really strange to be honest. Thanks for helping out everyone! | |
ID: 48597 | Rating: 0 | rate: / Reply Quote | |
All tasks were erroring out in 2 minutes due to the app using gcc5.5 sudo apt-get install gcc-5 g++-5 | |
ID: 48603 | Rating: 0 | rate: / Reply Quote | |
Yup, it completed. | |
ID: 48608 | Rating: 0 | rate: / Reply Quote | |
Great! Thank you very much | |
ID: 48610 | Rating: 0 | rate: / Reply Quote | |
@klepel - can you try installing gcc (if not already there)? I tried it yesterday. I installed gcc-5 and gcc-6. And it worked on the computer http://www.gpugrid.net/results.php?hostid=452211 | |
ID: 48613 | Rating: 0 | rate: / Reply Quote | |
These are the two WU's that were started about 5 seconds apart on an AMD FX-8350 with the first one started failing. I didn't have that experience with the two TONI tasks I started simultaneously. Or within the 5 second window you described. Both completed successfully. I am limiting core usage to four with an app_config file. Limiting the max_concurrent to 1 now since I also crunch SETI cpu tasks on that computer. I ran the two concurrent jobs when Toni requested users to try that experiment. Since this hasn't been the case with your Intel's implies this could be a cpu related phenomenom (architecture/scheduling differences). Perhaps the Intel's can handle initial start up processes faster than the FX series AMD, (might spring for a Ryen7 soon just to check them as well). Regardless, the issue is resolved with my systems by limiting concurrent QC jobs to one and use the other four cores to run WCG as to date I have not experienced a concurrent issue with the WCG WU's. | |
ID: 48616 | Rating: 0 | rate: / Reply Quote | |
Hello Keith, Yes, the first completed tasks got reasonable credit. Then when I downloaded more, all the credit for them nosedived. Once I saw that they weren't worth running I set NNT. 16862848 12962574 456812 3 Jan 2018 | 23:38:10 UTC 4 Jan 2018 | 1:20:46 UTC Completed and validated 1,000.69 3,715.60 21.12 Quantum Chemistry v3.14 (mt) 16815333 12963093 456812 4 Jan 2018 | 1:00:00 UTC 4 Jan 2018 | 2:57:45 UTC Completed and validated 1,020.00 3,812.63 27.38 Quantum Chemistry v3.14 (mt) 16815332 12963092 456812 4 Jan 2018 | 0:59:23 UTC 4 Jan 2018 | 2:40:47 UTC Completed and validated 990.38 3,697.44 25.78 Quantum Chemistry v3.14 (mt) 16815320 12963080 456812 4 Jan 2018 | 1:00:37 UTC 4 Jan 2018 | 3:14:28 UTC Completed and validated 997.66 3,731.61 27.28 Quantum Chemistry v3.14 (mt) 16815307 12963067 456812 4 Jan 2018 | 1:07:01 UTC 4 Jan 2018 | 4:04:01 UTC Completed and validated 1,009.28 3,642.48 26.40 Quantum Chemistry v3.14 (mt) 16815275 12963035 456812 4 Jan 2018 | 1:07:38 UTC 4 Jan 2018 | 4:21:09 UTC Completed and validated 1,033.63 3,668.31 26.52 Quantum Chemistry v3.14 (mt) 16815264 12963024 456812 4 Jan 2018 | 1:06:24 UTC 4 Jan 2018 | 3:46:54 UTC Completed and validated 969.15 3,592.69 25.72 Quantum Chemistry v3.14 (mt) 16815248 12963008 456812 4 Jan 2018 | 0:58:46 UTC 4 Jan 2018 | 2:24:17 UTC Completed and validated 935.31 3,503.52 23.59 Quantum Chemistry v3.14 (mt) 16815234 12962994 456812 4 Jan 2018 | 1:05:48 UTC 4 Jan 2018 | 3:30:44 UTC Completed and validated 981.30 3,616.45 26.43 Quantum Chemistry v3.14 (mt) 16815171 12962931 456812 3 Jan 2018 | 23:37:35 UTC 4 Jan 2018 | 1:04:10 UTC Completed and validated 929.82 3,523.44 18.45 Quantum Chemistry v3.14 (mt) 16815159 12962919 456812 3 Jan 2018 | 23:35:43 UTC 4 Jan 2018 | 0:16:22 UTC Completed and validated 986.88 3,687.96 161.36 Quantum Chemistry v3.14 (mt) | |
ID: 48617 | Rating: 0 | rate: / Reply Quote | |
I have to report back on the AMD Ryzen 1700x Computer: http://www.gpugrid.net/results.php?hostid=420971 | |
ID: 48618 | Rating: 0 | rate: / Reply Quote | |
However, as BOINC downloads several of this Quantum Chemistry v3.14 (mt) WUs, BOINC thinks my CPU cache is full and refuses to download additional CPU WUs from PRIMGRID. So after a while the CPU is only loaded with one QC WU (4 threads) and the rest of the cores are idle - Not very efficient. If you temporarily suspend the QC jobs (except perhaps the one in progress), you should download more work from your other projects and once downloaded, resume the QC jobs and let boinc take over running the various projects as you have them configured. You may need to "update" the projects you want more work from under the "projects" tab to initiate the downloads right away. Edit: Close quote and add last sentence. | |
ID: 48619 | Rating: 0 | rate: / Reply Quote | |
Seti@home has been down all day so ran out of work. Decided to give the QC Dominik tasks another try. Thought that possibly the last batch were unique or one-offs or something different about the computer from when I first ran them. | |
ID: 48624 | Rating: 0 | rate: / Reply Quote | |
Yeah credit took a dump on the last two I completed. Run Time-----CPU Time-----Credit 2,389.43-----26,206.90-----549.62 3,034.92-----36,710.33-----94.32 | |
ID: 48632 | Rating: 0 | rate: / Reply Quote | |
I got 5.97 points for 32 minutes calculation on my fx 6100. | |
ID: 48639 | Rating: 0 | rate: / Reply Quote | |
If you care to learn about why the low credit or why certain tasks get hi-middle-low credit, read my post here | |
ID: 48642 | Rating: 0 | rate: / Reply Quote | |
The project is using the "old" BOINC credit algorithm. If I remember how it works, the very first few tasks on a new application that BOINC sees gives very high credit. Then when more tasks are returned the algorithm tunes the credit downwards. The APR for the application stabilizes after 11 valid tasks have been returned. This project utilizes the application APR function of BOINC. It is up to each project to decide whether to follow BOINC standards or write their own. | |
ID: 48643 | Rating: 0 | rate: / Reply Quote | |
Seems to me that credit should be based on CPU time only. One WU should be worth a specific amount of credit regardless of how fast or slowly processed. Probably require a large bureaucratic committee to attempt to quantify such a value but that's my 2 cents and maybe it would be worth it if we get an answer within a few years :). | |
ID: 48653 | Rating: 0 | rate: / Reply Quote | |
Seems to me that credit should be based on CPU time only. One WU should be worth a specific amount of credit regardless of how fast or slowly processed. Probably require a large bureaucratic committee to attempt to quantify such a value but that's my 2 cents and maybe it would be worth it if we get an answer within a few years :). That's how Einstein handles credit. They don't use the BOINC algorithm and decide how much credit any application awards independent of run_time. | |
ID: 48654 | Rating: 0 | rate: / Reply Quote | |
Just posting to say that I have finally gotten my 6 core Phenom CPU working on this project. | |
ID: 48733 | Rating: 0 | rate: / Reply Quote | |
Message boards : Multicore CPUs : New batch of QC tasks (QMML)