Message boards : News : Experimental Python tasks (beta)
Author | Message |
---|---|
I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches. | |
ID: 55588 | Rating: 0 | rate: / Reply Quote | |
I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches. Preference Ticked, ready and waiting... EDIT: Received some already https://www.gpugrid.net/result.php?resultid=29466771 https://www.gpugrid.net/result.php?resultid=29466770 Conda Warnings reported. Will you push out update with app (or safe to ignore)? Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing.
environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda
registry file: /root/.conda/environments.txt
Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable Otherwise the task status is Completed and Validated | |
ID: 55590 | Rating: 0 | rate: / Reply Quote | |
Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think. | |
ID: 55591 | Rating: 0 | rate: / Reply Quote | |
Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think. Agreed Perhaps adding "./envs" switch to the end of the command:
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install
May help with setting up the environment. This switch should add environment file to current directory from which command is executed. | |
ID: 55592 | Rating: 0 | rate: / Reply Quote | |
I got one of these tasks which confused me as I have not set "accept beta applications" in my project preferences. | |
ID: 55724 | Rating: 0 | rate: / Reply Quote | |
What is the difference between these test Python apps and the standard one? Is it just that this application is coded in Python? what language are the default apps coded in? | |
ID: 55920 | Rating: 0 | rate: / Reply Quote | |
Both apps are wrappered. One is the stock acemd3 and I assume is written in some form of C. | |
ID: 55926 | Rating: 0 | rate: / Reply Quote | |
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2> I am receiving this error in STDerr Output for Experimental Python tasks on all my hosts. This is probably due to the fact all my PCs are behind a proxy. Can you please set the Python tasks to use the Proxy defined in the Boinc Client? Work Units here: https://www.gpugrid.net/result.php?resultid=31672354 https://www.gpugrid.net/result.php?resultid=31668427 https://www.gpugrid.net/result.php?resultid=31665961 | |
ID: 55933 | Rating: 0 | rate: / Reply Quote | |
Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority. | |
ID: 55936 | Rating: 0 | rate: / Reply Quote | |
Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority. I'm seeing that too lol. but it doesnt seem to be causing too much trouble for me since I don't run more than one GPU project concurrently. Only have Prime and backup. copying my message from another thread with my observations about these tasks for Toni to see if he doesnt check the other threads: Looks like I have 11 successful tasks, and 2 failures. ____________ | |
ID: 55945 | Rating: 0 | rate: / Reply Quote | |
Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority. Actually, that won't be the cause. The APRs are kept separately for each application, and once you have an 'active' APR (11 or more 'completions' - validated tasks for that app), they should keep out of each others way. What will F* things up is that this project still allows DCF to run free - and that's a single value which is applied to both task types. | |
ID: 55946 | Rating: 0 | rate: / Reply Quote | |
Yeah, after I wrote that I realized I meant the DCF is what is messing up the runtime estimations. | |
ID: 55947 | Rating: 0 | rate: / Reply Quote | |
what's DCF? | |
ID: 55948 | Rating: 0 | rate: / Reply Quote | |
what's DCF? Task Duration Correction Factor. The older BOINC server versions use it like Einstein. It messes up gpu tasks of different apps there too. | |
ID: 55949 | Rating: 0 | rate: / Reply Quote | |
You can't talk about 'their DCFs' - there is only one (there could have been more than one, but that's the way David chose to play it) | |
ID: 55951 | Rating: 0 | rate: / Reply Quote | |
What is also messed up is the size of the Anaconda Python task estimated computation size shown in the task properties. | |
ID: 55953 | Rating: 0 | rate: / Reply Quote | |
You can't talk about 'their DCFs' - there is only one (there could have been more than one, but that's the way David chose to play it) This daily driver has GPUGrid DCF Project properties currently at 85 and change. | |
ID: 55954 | Rating: 0 | rate: / Reply Quote | |
What is also messed up is the size of the Anaconda Python task estimated computation size shown in the task properties. can confirm. could this be why the credit reward is so high too? I wonder what the flop estimate was on this one from Kevvy: https://www.gpugrid.net/result.php?resultid=31679003 he got wrecked on this one, over 5hrs on a 2080ti, and got a mere 20 credits lol. ____________ | |
ID: 55955 | Rating: 0 | rate: / Reply Quote | |
I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully. | |
ID: 55956 | Rating: 0 | rate: / Reply Quote | |
I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully. what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable? what is the clock speed of your 3900X and memory speed as well? try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit. ____________ | |
ID: 55957 | Rating: 0 | rate: / Reply Quote | |
There's an explanation for 20 credit tasks over at Rosetta. | |
ID: 55958 | Rating: 0 | rate: / Reply Quote | |
what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable? That was one of the questions I wanted to ask Mr. Kevvy in the case he seems to be the first cruncher to successfully crunch a ton of them without errors. I wondered if his BOINC was a service install or a standalone. [Edit] OK, so Mr. Kevvy is still using the AIO. I wondered since a lot of our team seem to have dropped the AIO and gone back to the service install. So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. | |
ID: 55959 | Rating: 0 | rate: / Reply Quote | |
I'm almost positive he's running a standalone install. | |
ID: 55961 | Rating: 0 | rate: / Reply Quote | |
I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully. Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization. | |
ID: 55962 | Rating: 0 | rate: / Reply Quote | |
So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. difference in what sense? you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions. but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. I also don't know if he's using service installs on his Mint systems, he's got a lot of different BOINC versions across all his systems. ____________ | |
ID: 55963 | Rating: 0 | rate: / Reply Quote | |
Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization. thanks for the clarification. it was worth a shot on the GPU utilization with the free thread, low hanging fruit. I run my memory at 3600 CL14, but I've never seen memory matter that much even for CPU tasks on other projects, let alone GPU tasks. (I saw no difference when changing from 3200CL16 to 3600CL14), but anything's possible I guess. ____________ | |
ID: 55964 | Rating: 0 | rate: / Reply Quote | |
So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running. Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully. | |
ID: 55965 | Rating: 0 | rate: / Reply Quote | |
Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully. Yes, I know. But my point was that there are many differences between Mint 19 and 20, not just GLIBC version, and usually when GLIBC is an issue that shows up as the reason for the error in the task results, but that hasn't been the case. and conversely we have several examples of tasks hitting Ubuntu 20.04 systems with GLIBC of 2.31 and they still fail. I think it's just buggy. ____________ | |
ID: 55966 | Rating: 0 | rate: / Reply Quote | |
Yes, I had over a half dozen failed tasks before the first successful task. | |
ID: 55969 | Rating: 0 | rate: / Reply Quote | |
FYI, these tasks don't checkpoint properly. | |
ID: 55996 | Rating: 0 | rate: / Reply Quote | |
My host switch to another project task then resume and after a while i had to update system and restart. So it indeed fail to resume from last state so it looks like checkpoint was far behind or no checkpoint at all. Time stay at around 2 hour which was hours behind and est percentage locked at 10% | |
ID: 55997 | Rating: 0 | rate: / Reply Quote | |
The Python app runs ACEMD, but uses additional libraries to compute additional force terms. These libraries are distributed as Conda (Python) packages. | |
ID: 56007 | Rating: 0 | rate: / Reply Quote | |
Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). Over-crediting? I am seeing the opposite problem. https://www.gpugrid.net/result.php?resultid=31902208 20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar. ____________ Reno, NV Team: SETI.USA | |
ID: 56008 | Rating: 0 | rate: / Reply Quote | |
Thanks for the details. The flops estimate Yes, the "size" of the tasks, as expressed by <rsc_fpops_est> in the workunit template. The current value is 3,000 GFLOPS: all other GPUGrid task types are are 5,000,000 GFLOPS. An App which installs a self-contained Conda install We are encountering an unfortunate clash with the security of BOINC running as a systemd service under Linux. Useful bits of BOINC (pausing computation when the computer's user is active on the mouse or keyboard) rely on having access to the public /tmp/ folder structure. The conda installer wants to make use of a temporary folder. systemd allows us to have either public tmp folders (read only, for security), or private tmp folders (write access). But not both at the same time. We're exploring how to get the best of both worlds... Discussions in https://www.gpugrid.net/forum_thread.php?id=5204 https://github.com/BOINC/boinc/issues/4125 over-crediting We're enjoying it while it lasts! | |
ID: 56009 | Rating: 0 | rate: / Reply Quote | |
Over-crediting? OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti. Host 508381 | |
ID: 56010 | Rating: 0 | rate: / Reply Quote | |
Over-crediting? the 20 credits thing seems to only happen with restarted tasks from what ive seen. not sure if anything else triggers it. but I can say with certainty that the credit allocation is "questionable", and only appears to be related to the flops of device 0 in BOINC, as well as runtime. slow devices masked behind a fast device0 will earn credit at the rate of the faster device... ____________ | |
ID: 56011 | Rating: 0 | rate: / Reply Quote | |
Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all. ____________ | |
ID: 56012 | Rating: 0 | rate: / Reply Quote | |
We should perhaps mention the lack of effective checkpointing while we have Toni's attention. Even though the tasks claim to checkpoint every 0.9% (after the initial 10% allowed for the setup), the apps are unable to resume from the point previously reached. | |
ID: 56013 | Rating: 0 | rate: / Reply Quote | |
Over-crediting? I am seeing the opposite problem. I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so. ____________ Reno, NV Team: SETI.USA | |
ID: 56014 | Rating: 0 | rate: / Reply Quote | |
Over-crediting? I am seeing the opposite problem. you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that? does your system process the normal tasks fine? maybe it's something going on with your system as a whole. ____________ | |
ID: 56015 | Rating: 0 | rate: / Reply Quote | |
you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that? I have reached my wuprop goals for the other apps. So I am interested in only this particular app (for now). does your system process the normal tasks fine? maybe it's something going on with your system as a whole. Yep, all the other apps run fine, both here and on other projects. ____________ Reno, NV Team: SETI.USA | |
ID: 56016 | Rating: 0 | rate: / Reply Quote | |
you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that? I have a theory, but not sure if it's correct or not. can you tell me the peak_flops value reported in your coproc_info.xml file for the 2080ti? basically, since you are using such an old version of BOINC (7.9.3) which pre-dates the fixes implemented in 7.14.2 to properly calculate the peak flops of Turing cards. So I'm willing to bet that your version of BOINC is over-estimating your peak flops by a factor of 2. a 2080ti should read somewhere between 13.5 and 15 TFlops, and I'm guessing your old version of BOINC is thinking it's closer to double that (25-30 TFlops) the second half of the theory is that there is some kind of hard limit (maybe an anti-cheat mechanism?) that prevents a credit reward somewhere around >2,000,000. maybe 1.8million, maybe 1.9million? but I haven't observed ANYONE getting a task earning that much, and all tasks that would reach that level based on runtime seem to get this 20-credit value. thats my theory, i could be wrong. if you try a newer version of boinc that properly measures the flops on a turing card, and you start getting real credit, then it might hold water. ____________ | |
ID: 56017 | Rating: 0 | rate: / Reply Quote | |
Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?). Toni, One more issue to add to the list. The download from Ananconda website does not allow for hosts behind a proxy. Can you please add a check for Proxy settings in the BOINC client so external software can be downloaded? I have other hosts that are not behind a proxy and they download and run the Experimental tasks fine. Issue here: CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2> This error repeats itself until it eventually gives up after 5 minutes and fails the task. Happens on 2 hosts sitting behind a Web Proxy (Squid) | |
ID: 56018 | Rating: 0 | rate: / Reply Quote | |
A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository. | |
ID: 56019 | Rating: 0 | rate: / Reply Quote | |
A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository. it doesnt rule it out because a 1660ti has a much lower flops value. like 5.5 TFlop. so with the old boinc version, it's estimating ~11TFlop and that's not high enough to trigger the issue. you're only seeing it on the 2080ti because it's a much higher performing card. ~14TFlop by default, and the old boinc version is scaling it all the way up to 28+ TFlop. this causes the calculated credit to be MUCH higher than that of the 1660ti, and hence triggering the 20-cred issue, according to my theory of course. but your 1660ti tasks are well below the 2,000,000 credit threshold that i'm estimating. highest i've seen is ~1.7million, so the line cant be much higher. I'm willing to bet that if one of your tasks on that 1660ti system runs for ~30,000-40,000 seconds, it gets hit with 20 credits. ¯\_(ツ)_/¯ you really should try to get your hands on a newer version of BOINC. I use a version of BOINC that was compiled custom, and have usually used custom compiled versions from newer versions of the source code. maybe one of the other guys here can point you to a different repository that has a newer version of BOINC that can properly manage the Turing cards. ____________ | |
ID: 56020 | Rating: 0 | rate: / Reply Quote | |
i also verified that restarting ALONE, wont necessarily trigger the 20-credit reward. | |
ID: 56021 | Rating: 0 | rate: / Reply Quote | |
A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository. i see you changed BOINC to 7.17.0. another thing I noticed was that the change in tasks didnt take effect until new tasks were downloaded after the change, so tasks that were already there and tagged with the overinflated flops value will probably still get 20-cred. only the newly downloaded tasks after the change should work better. ____________ | |
ID: 56023 | Rating: 0 | rate: / Reply Quote | |
aaaand your 2080ti just completed a task and got credit with the new BOINC version. called it. | |
ID: 56027 | Rating: 0 | rate: / Reply Quote | |
I'm willing to bet that if one of your tasks on that 1660ti system runs for ~30,000-40,000 seconds, it gets hit with 20 credits. ¯\_(ツ)_/¯ looks like just 25,000s was enough to trigger it. http://www.gpugrid.net/result.php?resultid=31946707 it'll even out over time, since your other credits are earning 2x as much credit as you should be since the old version of BOINC is doubling your peak_flops value. ____________ | |
ID: 56028 | Rating: 0 | rate: / Reply Quote | |
After upgrading all the BOINC clients, the tasks are erroring out. Ugh. | |
ID: 56030 | Rating: 0 | rate: / Reply Quote | |
they were working fine on your 2080ti system when you had 7.17.0. why change it? | |
ID: 56031 | Rating: 0 | rate: / Reply Quote | |
they were working fine on your 2080ti system when you had 7.17.0. why change it? That was a kludge. There is no such thing as 7.17.0. =;^) Once I verified that the newer version worked, I updated all my machines with the latest repository version, so it would be clean and updated going forward. ____________ Reno, NV Team: SETI.USA | |
ID: 56033 | Rating: 0 | rate: / Reply Quote | |
There is such a thing. It’s the development branch. All of my systems use a version of BOINC based on 7.17.0 :) | |
ID: 56036 | Rating: 0 | rate: / Reply Quote | |
Well sure. I meant a released version. | |
ID: 56037 | Rating: 0 | rate: / Reply Quote | |
So long start to end run times cause the 20 credit issue, not that they were restarted. But tasks that are interrupted cause them to restart at 0, thus having a longer start to end run time. | |
ID: 56046 | Rating: 0 | rate: / Reply Quote | |
yup I confirmed this. I manually restarted a task that didnt run very long and it didnt have the issue. | |
ID: 56049 | Rating: 0 | rate: / Reply Quote | |
Why is the number of tasks in progress dwindling? Are no new tasks being issued? | |
ID: 56148 | Rating: 0 | rate: / Reply Quote | |
most of the Python tasks I've received in the last 3 days have been "_0", so that indicates brand new. and a few resends here and there. | |
ID: 56149 | Rating: 0 | rate: / Reply Quote | |
... I had the same error message except that mine was trying to go to /opt/boinc/.conda/environments.txt | |
ID: 56151 | Rating: 0 | rate: / Reply Quote | |
... I had the same error message except that mine was trying to go to... /opt/boinc/.conda/environments.txt Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think. Gentoo put the home for boinc at /opt/boinc. I updated the user file to change it to /var/lib/boinc. | |
ID: 56152 | Rating: 0 | rate: / Reply Quote | |
I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches. What type of card minimum for this app. My 980Ti don't load WU. ____________ | |
ID: 56177 | Rating: 0 | rate: / Reply Quote | |
I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches. In "GPUGRID Preferences", ensure you select "Python Runtime (beta)" and "Run test applications?" Your GPU, driver and OS should run these tasks fine | |
ID: 56181 | Rating: 0 | rate: / Reply Quote | |
I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches. Merci, I just forgot Run test applications :) ____________ | |
ID: 56182 | Rating: 0 | rate: / Reply Quote | |
All of these seem now to error out after computation has finished. On several computers: <message> upload failure: <file_xfer_error> <file_name>2p95312000-RAIMIS_NNPMM-0-1-RND8920_1_0</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> </message> What causes this and how it can be fixed? | |
ID: 56183 | Rating: 0 | rate: / Reply Quote | |
What causes this and how it can be fixed? I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching). Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience. | |
ID: 56185 | Rating: 0 | rate: / Reply Quote | |
What causes this and how it can be fixed? really needs to be fixed server side (or would be nice if it were configurable via cc_config but that doesnt look to be the case either). stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day. ____________ | |
ID: 56186 | Rating: 0 | rate: / Reply Quote | |
Stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day. By itself, it's fairly safe - provided you know and understand the software on your own system well enough. But you do need to have that experience and knowledge, which I why I put the caveats in. I agree about having to re-do it for every new task, but I'd like to get my APR back up to something reasonable - and I'm happy to help nudge the admins one more step along the way to a fully-working, 'set and forget', application. | |
ID: 56187 | Rating: 0 | rate: / Reply Quote | |
They're working on something... | |
ID: 56189 | Rating: 0 | rate: / Reply Quote | |
What causes this and how it can be fixed? Excaltly so. I don't know about others, but I have no time to sit and watch my hosts working. A host is working 10 hours to get the task done, and then everything turns out to be just a waste of time and energy because of this file size limitation. This is somewhat frustrating. | |
ID: 56208 | Rating: 0 | rate: / Reply Quote | |
Opt out of the Beta test programme if you don't want to encounter those problems. | |
ID: 56209 | Rating: 0 | rate: / Reply Quote | |
Opt out of the Beta test programme if you don't want to encounter those problems. Yes, I agree - something has changed. It looks like the last full time (successful) computation on my hosts that produced too large output file was WU 26900019, ended 29 Dec 2020 | 15:00:52 UTC after 31,056 seconds of run time. | |
ID: 56210 | Rating: 0 | rate: / Reply Quote | |
I see some new Python tasks have gone out. however they seem to be erroring for everyone. Environment ____________ | |
ID: 56864 | Rating: 0 | rate: / Reply Quote | |
now seeing this:
and this: 09:57:32 (340085): wrapper (7.7.26016): starting ____________ | |
ID: 56865 | Rating: 0 | rate: / Reply Quote | |
just had my first two successful completions. doesn't look like it ran any GPU work though, the GPU was never loaded. just unpacked the WU, ran the setup. then exited. marked as complete with no error. only ran for about 45 seconds. | |
ID: 56866 | Rating: 0 | rate: / Reply Quote | |
just had my first two successful completions. doesn't look like it ran any GPU work though, the GPU was never loaded. just unpacked the WU, ran the setup. then exited. marked as complete with no error. only ran for about 45 seconds. Did you have to up-date conda for the two successful tasks? I received a few new WUs but all errored. I will not have access to this computer until tomorrow. | |
ID: 56867 | Rating: 0 | rate: / Reply Quote | |
just had my first two successful completions. doesn't look like it ran any GPU work though, the GPU was never loaded. just unpacked the WU, ran the setup. then exited. marked as complete with no error. only ran for about 45 seconds. I didnt make any changes to my system between failed tasks and successful tasks. AFAIK the project is sending conda packaged into these WUs so it doesn't matter what you have installed, it contains everything you should need. looks like testrun93+ ish are OK, but test runs in the 80s and lower all fail with some form of error like the errors I listed above. ____________ | |
ID: 56868 | Rating: 0 | rate: / Reply Quote | |
All of these Python WU's seem to fail. A pair of examples with different problems: | |
ID: 56874 | Rating: 0 | rate: / Reply Quote | |
some succeed. but very few. out of the 94 python tasks i've received recently. only 4 of them succeeded. | |
ID: 56875 | Rating: 0 | rate: / Reply Quote | |
i see some new tasks going out. 11:06:39 (1387708): /usr/bin/flock exited; CPU time 281.233647 ____________ | |
ID: 56876 | Rating: 0 | rate: / Reply Quote | |
some succeed. but very few. out of the 94 python tasks i've received recently. only 4 of them succeeded. 65 received / 64 errored / 1 successful is my current balance | |
ID: 56878 | Rating: 0 | rate: / Reply Quote | |
204 failed with 5 succeded. RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` https://www.gpugrid.net/result.php?resultid=32584418 I did reading at boinc discord today that MLC@Home also testing pytorch and looks like it cause some issues. PyTorch uses SIGARLM internally, which seems to conflict with libboinc API's usage of SIGALRM. I hope Toni would get this working soon it looks to be complex setup. | |
ID: 56879 | Rating: 0 | rate: / Reply Quote | |
Most off task for Anaconda Python 3 worked well today. Some changes have been made. | |
ID: 56880 | Rating: 0 | rate: / Reply Quote | |
It seems that these Python tasks are being used to train some kind of AI/Machine Learning model. | |
ID: 56881 | Rating: 0 | rate: / Reply Quote | |
Most off task for Anaconda Python 3 worked well today. Some changes have been made. side-note: you should set no new tasks or remove GPUGRID from your RTX 30-series hosts. the applications here do not work with RTX 30-series Ampere cards and always produce errors. ____________ | |
ID: 56882 | Rating: 0 | rate: / Reply Quote | |
Looks like he only let one acemd3 task slip through to an Ampere card. | |
ID: 56883 | Rating: 0 | rate: / Reply Quote | |
Looks like he only let one acemd3 task slip through to an Ampere card. They care. They are still CUDA 10.0. And were compiled without the proper configurations for ampere. They will all still fail under an Ampere card. The Python tasks they’ve been pushing out recently never actually run any work on the GPU. They do a little bit of CPU processing and then complete or error. Even the few that succeed never touch the GPU. ____________ | |
ID: 56884 | Rating: 0 | rate: / Reply Quote | |
I see a bunch of Python tasks went out again. | |
ID: 56926 | Rating: 0 | rate: / Reply Quote | |
All junk for me. None have completed. Pretty sure some have before for me. All around 525-530 seconds. Nice ETA of 646 days so BOINC freaks out. | |
ID: 56927 | Rating: 0 | rate: / Reply Quote | |
A have 3 of these valid over the past couple days. None of them used the GPU. Did they complete any work? | |
ID: 56934 | Rating: 0 | rate: / Reply Quote | |
This one worked for me after that same PC failed earlier in the week | |
ID: 56935 | Rating: 0 | rate: / Reply Quote | |
A have 3 of these valid over the past couple days. None of them used the GPU. Did they complete any work? I agree. It’s weird that these tasks are marked as being a GPU task with CUDA10.0, makes the GPU otherwise unavailable for other tasks in BOINC, yet they never touch the GPU. According to the stderr.txt, they seem to spend most of their time extracting and installing packages, then does “something” for a few seconds and completes. It’s obvious that they are exploring some kind of machine learning approach based on the packages used (pytorch, tensorflow, etc) and references to model training. Maybe they are still working out how to properly package the WUs so they have the right configuration for future real tasks. Would be cool to hear what they are actually trying to accomplish with these tasks. ____________ | |
ID: 56936 | Rating: 0 | rate: / Reply Quote | |
Would be cool to hear what they are actually trying to accomplish with these tasks. I guess you will never hear any details from them. As we know, the GPUGRID people are very taciturn on everything. | |
ID: 56938 | Rating: 0 | rate: / Reply Quote | |
Would be cool to hear what they are actually trying to accomplish with these tasks. In other times, when the Gpugrid Project run smoothly, they used to be more polite by returning some feedback to contributors. I guess that there must be heavy reasons for this current lack of communication. | |
ID: 56939 | Rating: 0 | rate: / Reply Quote | |
For the time being we are perfecting the WU machinery so to support ML packages + CUDA. All tasks are linux beta for now. Thanks! | |
ID: 56940 | Rating: 0 | rate: / Reply Quote | |
Thank you for this pearl! | |
ID: 56941 | Rating: 0 | rate: / Reply Quote | |
For the time being we are perfecting the WU machinery so to support ML packages + CUDA. All tasks are linux beta for now. Thanks! Thanks, Toni. Can you explain why these tasks are not using the GPU at all? they only run on the CPU. GPU utilization stays at 0% ____________ | |
ID: 56942 | Rating: 0 | rate: / Reply Quote | |
I would like to know whether we are supposed to do the things requested in the output file. Things like updating the various packages that are called out. | |
ID: 56943 | Rating: 0 | rate: / Reply Quote | |
I would like to know whether we are supposed to do the things requested in the output file. Things like updating the various packages that are called out. I'm relatively sure these tasks are sandboxed. the packages being referenced are part of the whole WU (tensorflow). they are installed by the extraction phase in the beginning of the WU. if you check your system you will find that you do not have tensorflow installed most likely. the package updates need to happen on the project side before distribution to us. ____________ | |
ID: 56944 | Rating: 0 | rate: / Reply Quote | |
I wonder if I should add the project to my Nvidia Nano. It has Tensorflow installed by default in the distro. | |
ID: 56945 | Rating: 0 | rate: / Reply Quote | |
I wonder if I should add the project to my Nvidia Nano. It has Tensorflow installed by default in the distro. you can try, but I don't think it'll run because of the ARM CPU. there's no app for that here. ____________ | |
ID: 56946 | Rating: 0 | rate: / Reply Quote | |
Ah, yes . . . . forgot about that small matter . . . . | |
ID: 56947 | Rating: 0 | rate: / Reply Quote | |
I would like to know whether we are supposed to do the things requested in the output file. Things like updating the various packages that are called out. Furthermore, I checked on my Gentoo system what it would take to install Tensorflow. The only vesion available to me required python 3.8. I didn't even bother to check it out because my system is using python 3.9 stable. Things may become easier with the app if they are able to upgrade to python 3.8. I don't know how this will work with python 3.7. Is it just Gentoo taking the 3.7 option away? emerge -v1p tensorflow | |
ID: 56949 | Rating: 0 | rate: / Reply Quote | |
My Ubuntu 20.04.2 LTS distro has Python 3.8.5 installed so should satisfy the tensorflow requirements. | |
ID: 56950 | Rating: 0 | rate: / Reply Quote | |
I was trying to catch some ACEMD tasks to test the oversized upload file report, but got a block of Pythons instead. gcc: fatal error: cannot execute ‘cc1plus’: execvp: No such file or directory compilation terminated. Machine is Linux Mint 20.1, freshly updated today (including BOINC v7.16.17, which is an auto-build test for the Mac release last week - not relevant here, but useful to keep an eye on to make sure they haven't broken anything else). I have a couple of spare tasks suspended - I'll look through the actual runtime packaging to see what they're trying to achieve. | |
ID: 56958 | Rating: 0 | rate: / Reply Quote | |
Richard, I haven't had any GCC errors with any of the Python tasks on my hosts. | |
ID: 56960 | Rating: 0 | rate: / Reply Quote | |
Interestingly, the task that failed to run gcc was re-issued to a computer on ServicEnginIC's account - and ran successfully. That gives me a completely different stderr_txt file to compare with mine. I'll make a permanent copy of both for reference, and try to work out what went wrong. | |
ID: 56962 | Rating: 0 | rate: / Reply Quote | |
Interestingly, the task that failed to run gcc was re-issued to a computer on ServicEnginIC's account - and ran successfully. That gives me a completely different stderr_txt file to compare with mine. I'll make a permanent copy of both for reference, and try to work out what went wrong. I remember that I applied to all my hosts a kind remedy suggested by you at your message #55967. I related it at message #55986 Thank you again. | |
ID: 56963 | Rating: 0 | rate: / Reply Quote | |
Thanks for the kind words. Yes, that's necessary, but not sufficient. My host 132158 got a block of four tasks when I re-enabled work fetch this morning. [1008937] INTERNAL ERROR: cannot create temporary directory! 11:23:17 (1008908): /usr/bin/flock exited; CPU time 0.132604 - that same old problem. I stopped the machine, did a full update and restart, and verified that the new .service file had the fix for that bug. Then I fired off task ID _625 - that's the one with the cpp errors. Unfortunately, we only get the last 64 KB of the file, and it's not enough in this case - we can't see what stage it's reached. But since the first task only ran 3 seconds, and the second lasted for 190 seconds, I assume we fell at the second hurdle. My second Linux machine has just picked up two of the ADRIA tasks I was hunting for - I'll sort those out next. | |
ID: 56964 | Rating: 0 | rate: / Reply Quote | |
I was trying to catch some ACEMD tasks to test the oversized upload file report, but got a block of Pythons instead. I got a similar error some time ago. A memory module was faulty, started to get segmentation fault errors. Eventually my compiling environment (gcc etc.) became messed up. Solved the situation by removing the bad module and completely reinstalling the compiling environment. What I might suggest to do is try to verify if gcc/g++ are actually working, by compiling something of your choice. | |
ID: 56975 | Rating: 0 | rate: / Reply Quote | |
Finally got time to google my gcc error. Simples: turns out the app requires g++, and it's not installed by default on Ubuntu - and, one assumes, derivatives like mine. sudo apt-get install g++ No restart needed, of either BOINC or Linux, and task 32623619 completed successfully. No much sign of any checkpointing: one update at 10%, then nothing until the end. | |
ID: 56980 | Rating: 0 | rate: / Reply Quote | |
Finally got time to google my gcc error. Simples: turns out the app requires g++, and it's not installed by default on Ubuntu - and, one assumes, derivatives like mine. Hummm... I have run a few on Ubuntu 20.04.2 and did not do anything special, unless maybe something else I was working on required it. http://www.gpugrid.net/results.php?hostid=452287 | |
ID: 56981 | Rating: 0 | rate: / Reply Quote | |
Finally got time to google my gcc error. Simples: turns out the app requires g++, and it's not installed by default on Ubuntu - and, one assumes, derivatives like mine. I think this just might be your distribution. I never installed this (specifically) on my Ubuntu 20.04 systems. if it's there, it was there by default or through some other package as a dependency. ____________ | |
ID: 56982 | Rating: 0 | rate: / Reply Quote | |
This was a fairly recent (February 2021) clean installation of Linux Mint 20.1 'Ulyssa' - I decided to throw away my initial fumblings with Mint 19.1, and start afresh. So, let this be a warning: not every distro is as complete as you might expect. | |
ID: 56983 | Rating: 0 | rate: / Reply Quote | |
errors on the Python tasks again. | |
ID: 56999 | Rating: 0 | rate: / Reply Quote | |
errors on the Python tasks again. I see them too. http://www.gpugrid.net/results.php?hostid=452287 UnsatisfiableError: The following specifications were found to be incompatible with each other: That will give them something to work on. | |
ID: 57001 | Rating: 0 | rate: / Reply Quote | |
I'm now using my GPU preferably for tasks related to medical research. Could you mention whether the Python tasks are related to medical research and whether they use the GPU? | |
ID: 57006 | Rating: 0 | rate: / Reply Quote | |
I'm now using my GPU preferably for tasks related to medical research. Could you mention whether the Python tasks are related to medical research and whether they use the GPU? right now these Python tasks are using machine learning to do what we assume is some medical kind of research, but the admins haven't given many specifics on exactly how or what type of medical research specifically is being done. GPUGRID as a whole does various types of medical research. see more info about it in the other thread here: https://www.gpugrid.net/forum_thread.php?id=5233 the tasks are labelled as being GPU tasks, and they do otherwise reserve the GPU in BOINC (ie, other tasks wont run on it), however in reality the GPU is not actually used. it sits idle and all the computation happens on the CPU thread that's assigned to the job. the admins have stated that these early tasks are still in testing and they will use the GPU in the future. but right now they don't. the other thing to keep in mind, the Python application is Linux-only (at least right now). you wont be able to get these tasks on your Windows system. ____________ | |
ID: 57007 | Rating: 0 | rate: / Reply Quote | |
Just finished a new Python task that didn't error out. Hope this is the start of a trend. | |
ID: 57036 | Rating: 0 | rate: / Reply Quote | |
Message boards : News : Experimental Python tasks (beta)