Author |
Message |
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone, just wanted to give some updates about the machine learning - python jobs that Toni mentioned earlier in the "Experimental Python tasks (beta) " thread.
What are we trying to accomplish?
We are trying to train populations of intelligent agents in a distributed computational setting to solve reinforcement learning problems. This idea is inspired in the fact that human societies are knowledgeable as a whole, while individual agents have limited information. Also, every new generation of individuals attempts to expand and refine the knowledge inherited from previous ones, and the most interesting discoveries become part of a corpus of common knowledge. The idea is that small groups of agents will train in GPUgrid machines, and report their discoveries and findings. Information of multiple agents can be put in common and conveyed to new generations of machine learning agents. To the best of our knowledge this is the first time something of this sort is attempted in a GPUGrid-like platform, and has the potential to scale to solve problems unattainable in smaller scale settings.
Why most jobs were failing a few weeks ago?
It took us some time and testing to make simple agents work, but we managed to solve the problems in the previous weeks. Now, almost all agents train successfully.
Why are GPUs being underutilized? and why are CPU used for?
In the previous weeks we were running small scale tests, with small neural networks models that occupied little GPU memory. Also, some reinforcement learning environments, especially simple ones like those used in the test, run on CPU. Our idea is to scale to more complex models and environments to exploit the GPU capacity of the grid.
More information:
We use mainly PyTorch to train our neural networks. We only use Tensorboard because it is convenient for logging. We might remove that dependency in the future.
____________
|
|
|
|
Highly anticipated and overdue. Needless to say, kudos to you and your team for pushing the frontier on the computational abilities of the client software. Looking forward to contribute in the future, hopefully with more than I have at hand right now.
A couple of questions though:
1. As the main ML technique used for training the individual agents is neural networks, I wonder about the specifics of the whole setup? What does the learning data set look like? What AF do you use? Any optimisation, regularisation used?
2. Is it mainly about getting this kind of framework to work and then test for its accuracy? How did you determine the model's base parameters as is to get you started? How can you be sure that the initial model setup is getting you anywhere/is optimal? Or do you ultimately want to tune the final model and compare the accuracy of various reinforced learning approaches?
3. Is there a way to gauge the future complexity of those prospective WUs at this stage? Similar runtimes as the current Bandit tasks?
4. What do you want to use the trained networks for? What are you trying to predict? Or rephrased what main use cases/fields of research are currently imagined for the final model?
What do you envision to be "problems [so far] unattainable in smaller scale settings" ?
5. What is the ultimate goal of this ML-project? Have only one latest gen trained agents group at the end that is the result of the continuous reinforeced learning iterations? Have several and test/benchmark them against each other?
Thx! Keep up the great work! |
|
|
|
will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.
____________
|
|
|
phi1258 Send message
Joined: 30 Jul 16 Posts: 4 Credit: 1,555,158,536 RAC: 0 Level
Scientific publications
|
This is a welcome advance. Looking forward to contributing.
|
|
|
|
Thank you very much for this advance.
I understand that on this kind of "singular" research only a limited general guidelines can be given, or there is a risk for them not being singular any more...
Best wishes. |
|
|
_heinzSend message
Joined: 20 Sep 13 Posts: 16 Credit: 3,433,447 RAC: 0 Level
Scientific publications
|
Wish you sucess.
regards _heinz
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Ian&Steve C. wrote on June 17th:
will you be utilizing the tensor cores present in the nvidia RTX cards? the tensor cores are designed for this kind of workload.
I am courious what the answer will be
|
|
|
|
also, can the team comment on not just GPU "under"utilization. these have NO GPU utilization.
when will you start releasing tasks that do more than just CPU calculation? are you aware that only CPU calculation is occurring and nothing happens on the GPU at all? I have never observed these new tasks to use the GPU, ever. even the tasks that takes ~1hr to crunch. it all happens on the single CPU thread allocated for the WU. 0% GPU utilization and no gpugrid processes reported in nvidia-smi
____________
|
|
|
|
I understand this is basic research in ML. However, I wonder which problems it would be used for here. Personally I'm here for the bio-science. If the topic of the new ML research differs significantly and it seems to be successful based on first trials, I'd suggest to set it up as a seperate project.
MrS
____________
Scanning for our furry friends since Jan 2002 |
|
|
|
This is why I asked what "problems" are currently envisioned to be tackled by the resulting model. But IMO and understanding this is a ML project specifically set up to be trained on biomedical data sets. Thus, I'd argue that the science being done is still bio-related nonetheless. Would highly appreciate a feedback to loads of great questions here in this thread so far. |
|
|
|
https://www.youtube.com/watch?v=yhJWAdZl-Ck |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
I noticed some python tasks in my task history. All failed for me and failed so far for everyone else. Has anyone completed any?
Examnple:
https://www.gpugrid.net/workunit.php?wuid=27100605 |
|
|
|
Host 132158 is getting some. The first failed with:
File "/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py", line 28, in run
sys.stderr.write("Unable to execute '{}'. HINT: are you sure `make` is installed?\n".format(' '.join(cmd)))
NameError: name 'cmd' is not defined
----------------------------------------
ERROR: Failed building wheel for atari-py
ERROR: Command errored out with exit status 1:
command: /var/lib/boinc-client/slots/0/gpugridpy/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"'; __file__='"'"'/tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-k6sefcno/install-record.txt --single-version-externally-managed --compile --install-headers /var/lib/boinc-client/slots/0/gpugridpy/include/python3.8/atari-py
cwd: /tmp/pip-install-kvyy94ud/atari-py_bc0e384c30f043aca5cad42554937b02/
Looks like a typo. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Shame the tasks are misconfigured. I ran through a dozen of them on a host with errors. With the scarcity of work, every little bit is appreciated and can be used.
We just got put back in good graces with a whitelist at Gridcoin too. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
@abouh, could you check your configuration again? The tasks are failing during the build process with cmake. cmake normally isn't installed in Linux and when it is it is not normally installed into the PATH environment.
It probably needs to be exported into the userland environment. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone, sorry for the late reply.
we detected the "cmake" error and found a way around it that does not require to install anything. Some jobs already finished successfully last Friday without reporting this error.
The error was related to the atari_py, as some users reported. More specifically installing this python package from github https://github.com/openai/atari-py, which allows to use some Atari2600 games as a test bench for reinforcement learning (RL) agents.
Sorry for the inconveniences. Even while the AI agents part of the code has been tested and works, every time we need to test our agents in a new environment we need te modify environment initialisation part of the code with the one containing the new environment, in this case atari_py.
I just sent another batch of 5 test jobs, 3 already finished the others seem to be working without problems but have not yet finished.
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730763
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730759
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730761
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730760
http://gpugrid.net/PS3GRID_ops/db_action.php?table=result&id=32730762
____________
|
|
|
|
Multiple different failure modes among the four hosts that have failed (so far) to run workunit 27102466. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The error reported in the job with result ID 32730901 is due to a conda environment error detected and solved during previous testing bouts.
It is the one that talk about a dependency called "pinocchio" and detects conflicts with it.
Seems like the conda misconfiguration persisted in some machines. To solve this error should be enough to click "reset" to reset the App.
____________
|
|
|
|
OK, I've reset both my Linux hosts. Fortunately I'm on a fast line for the replacement download... |
|
|
|
Task e1a15-ABOU_rnd_ppod_3-0-1-RND2976_3 was the first to run after the reset, but unfortunately it failed too.
Edit - so did e1a14-ABOU_rnd_ppod_3-0-1-RND3383_2, on the same machine.
This host also has 16 GB system RAM: GPU is GTX 1660 Ti. |
|
|
|
I reset the project on my host. still failed.
WU: http://gpugrid.net/workunit.php?wuid=27102456
I see that ServicEnginIC and I both had the same error. we also both only have 16GB system memory on our host.
Aurum previously reported very high system memory use, but didn't elaborate on if it was real or virtual.
However, I can elaborate further to confirm that it's real.
https://i.imgur.com/XwAj4s3.png
a lot of it seems to stem from the ~4GB used by the python run.py process and then +184M for each of 32x multiproc spawns that appear to be running. not sure if these are intended to run, or if these were are artifact of setup that never got cleaned up?
I'm not certain, but it's possible that the task ultimately failed due to lack of resources having both RAM and Swap maxed out. maybe the next system that has it will succeed with it's 64GB TR system?
abouh, is it intended to keep this much system memory used during these tasks? or is the just something leftover that was supposed to be cleaned up? It might be helpful to know the exact system requirements so people with unsupported hardware do not try to run these tasks. if these tasks are going to use so much memory and all of the CPU cores, we should be prepared for that ahead of time.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I couldn't get your imgur image to load, just a spinner. |
|
|
|
Yeah I get a message that Imgur is over capacity (first time I’ve ever seen that). Their site must be having maintenance or getting hammered. It was working earlier. I guess just try again a little later.
____________
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
I've had two tasks complete on a host that was previously erroring out:
https://www.gpugrid.net/workunit.php?wuid=27102460
https://www.gpugrid.net/workunit.php?wuid=27101116
Between 12:45:58 UTC and 19:44:33 UTC a task failed and then completed w/o any changes, resets, anything from me.
Wildly different runtime/credit ratios, I would expect something in between.
Run time Credit Credit/sec
3,389.26 264,786.85 78/s
49,311.35 34,722.22 0.70/s
CUDA
26,635.40 420,000.00 15.77/s |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone,
The reset was only to solve the error reported in e1a12-ABOU_rnd_ppod_3-0-1-RND1575_0 and other jobs, relative to a dependency called "pinocchio". I have checked the jobs reported to have errors after resetting, it seems like this error is not present in those jobs.
Regarding the memory usage, it is real as you report. The ~4GB are from the main script containing the AI agent and the training process. The 32x multiproc spawns are intended, each one contains an instance of the environment the agent interacts with to learn. Some RL environments run on GPU, but unfortunately the one we are working with at the moment does not. I get a total of 15GB locally when running 1 job. This could probably explain some job failures. Running all these environments in parallel is also more CPU intense as mentioned as well. The process to train the AI interleaves phases of data collection from interactions with the environment instances (CPU intensive), with phases of learning (GPU intensive)
I will test locally if the AI agent still learns by interacting with less instances of the environment at the same time, that could help reduce a bit the memory requirements in future jobs. However, for now the most immediate jobs will have similar requirements.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes I was progressively testing for how many steps the Agents could be trained and I forgot to increase the credits proportionally to the training steps. I will correct that in the immediate next batch, sorry and thanks for making us notice.
____________
|
|
|
PDWSend message
Joined: 7 Mar 14 Posts: 15 Credit: 4,908,594,525 RAC: 31,921,568 Level
Scientific publications
|
On mine, free memory (as reported in top) dropped from approximately 25,500 (when running an ACEMD task) to 7,000.
That I can manage.
However the task also spawns a process for the number of threads (x) the machine has and then runs these, from 1 to x processes can be running at any one time. The value x is based on the machine threads and not what Boinc is configured for, in addition Boinc has no idea they exist and should be taken into account for scheduling purposes. The result is that the machine can at times be loading the CPU upto twice as much as expected. This I can't manage unless I only run one of these tasks and the machine is doing nothing else which isn't going to happen. |
|
|
|
thanks for the clarification.
I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.
in my case, i did notice that each spawn used only a little CPU, but I'm not sure if this is the case for everyone. you could in theory tell BOINC how much CPU these are using by using a value over 1 in app_config for python tasks . for example, it looks like only ~10% of a thread was being used. so for my 32 thread CPU, that would equate to about 4 threads worth (round up from 3.2). so maybe something like
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
</app>
you'd have to pick a cpu_usage value appropriate for your CPU use, and test to see if it works as desired.
____________
|
|
|
|
I agree with PDW that running work on all CPUs threads when BOINC expects at most that 1 CPU thread will be used will be problematic for most users who run CPU work from other projects.
The normal way of handling that is to use the [MT] (multi-threaded) plan class mechanism in BOINC - these trial apps are being issued using the same [cuda1121] plan class as the current ACEMD production work.
Having said that, it might be quite tricky to devise a combined [CUDA + MT] plan class. BOINC code usually expects a simple-minded either/or solution, not a combination. And I don't really like the standard MT implementation, which defaults to using every possible CPU core in the volunteer's computer. Not polite.
MT can be tamed by using an app_config.xml or app_info.xml file, but you may need to tweak both <cpu_usage> (for BOINC scheduling purposes) and something like a command line parameter to control the spawning behaviour of the app. |
|
|
|
given the current state of these beta tasks, I have done the following on my 7xGPU 48-thread system. allowed only 3x Python Beta tasks to run since the systems only have 64GB ram and each process is using ~20GB.
app_config.xml
<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>5.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<max_concurrent>3</max_concurrent>
</app>
</app_config>
will see how it works out when more python beta tasks flow. and adjust as the project adjusts settings.
abouh, before you start releasing more beta tasks, could you give us a heads up to what we should expect and/or what you changed about them?
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem. |
|
|
|
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.
Good to know Keith.
Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I finished up a python gpu task last night on one host and saw it spawned a ton of processes that used up 17GB of system memory. I have 32GB minimum in all my hosts and it was not a problem.
Good to know Keith.
Did you by chance get a look at GPU utilization? Or CPU thread utilization of the spawns?
Gpu utilization was at 3%. Each spawn used up about 170MB of memory and fluctuated around 13-17% cpu utilization.
|
|
|
|
good to know. so what I experienced was pretty similar.
I'm sure you also had some other CPU tasks running too. I wonder if CPU utilization of the spawns would be higher if no other CPU tasks were running.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes primarily Universe and a few TN-Grid tasks were running also. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I will send some more tasks later today with similar requirements as the last ones, with 32 multithreading reinforcement learning environments running in parallel for the agent to interact with.
For one job, locally I get around 15GB of system memory, and each cpu 13% - 17% utilisation as mentioned. For the GPU, the usage fluctuates between low use (5%-10%) during the phases in which the agent collects data from the environments and short high utilisation peaks of a few seconds, when the agent uses the data to learn (I get between 50% and 80%).
I will try to train the agents for a bit longer than in the last tasks. I have already corrected the credits of the tasks, in proportion to the number of interaction between the agent and the environments occurring in the tasks.
____________
|
|
|
|
I got 3 of them just now. all failed with tracebacks after several minutes of run time. seems like there's still some coding bugs in the application. all wingmen are failing similarly:
https://gpugrid.net/workunit.php?wuid=27102526
https://gpugrid.net/workunit.php?wuid=27102527
https://gpugrid.net/workunit.php?wuid=27102525
GPU (2080Ti) was loaded ~10-13% GPU utilization, but at base clocks 1350MHz and only ~65W power draw. GPU memory loaded 2-4GB. system memory reached ~25GB utilization while 2 tasks were running at the same time. CPU thread utilization ~25-30% across all 48 threads (EPYC 7402P), it didn't cap at 32 and about twice as much CPU utilization as expected, but maybe that's due to relatively low clock speed @ 3.35GHz. (I paused other CPU processing during this time).
____________
|
|
|
|
the new one I just got seems to be doing better. less CPU use, and it looks like i'm seeing the mentioned 60-80% spikes on the GPU occasionally.
this one succeeded on the same host as the above three.
https://gpugrid.net/workunit.php?wuid=27102535
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I normally test the jobs locally first, to then run a couple of small batches of tasks in GPUGrid in case some error that did not appear locally occurs. The first small batch failed so I could fix the error in the second one. Now that the second batch succeeded will send a bigger batch of tasks.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I must be crunching one of the fixed second batch currently on this daily driver. Seems to be progressing nicely.
Using about 17GB of system memory and the gpu utilization spikes up to 97% every once in a while with periods mostly spent around 12-17% with some brief spikes around 42%.
I got one of the first batch on another host that failed fast with similar along with all the wingmen. |
|
|
|
these new ones must be pretty long.
been running almost 2 hours now. and a lot higher VRAM use. over 6GB per task used on the VRAM. GPUs with less than 6GB have issues?
but it also seems that some of the system memory used can be shared. running 1 task shows ~17GB system mem use, but running 5x tasks shows about 53GB system mem use. that's as far as I'll push it on my 64GB machines.
____________
|
|
|
kksplaceSend message
Joined: 4 Mar 18 Posts: 53 Credit: 2,591,271,749 RAC: 6,720,230 Level
Scientific publications
|
I got the first one of the Python WUs for me, and am a little concerned. After 3.25 hours it is only 10% complete. GPU usage seems to be about what you all are saying, and same with CPU. However, I also only have 8 cores/16 threads, with 6 other CPU work units running (TN Grid and Rosetta 4.2). Should I be limiting the other work to let these run? (16 GB RAM). |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I don't think BOINC knows how to handle interpreting the estimated run_times of these Python tasks. I wouldn't worry about it.
I am over 6 1/2 hours now on this daily driver with 10% still showing. I bet they never show anything BUT 10% done until they finish. |
|
|
|
I had the same feeling, Keith
____________
|
|
|
|
also those of us running these, should probably prepare for VERY low credit reward.
This is something I have observed for a long time with beta tasks here. there seems to be some kind of anti-cheat mecahnism (or bug) built into BOINC when using the default credit reward scheme (based on flops), if the calculated credit reward is over some value, the credit reward gets defaulted to some very low value. since these are so long running, and beta, I fully expect to see this happen. I've reported about this behavior in the past.
would be a nice surprise if not, but I have a strong feeling it'll happen.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I got one task early on that rewarded more than reasonable credit.
But the last one was way low but I thought I read a post from @abouh that he had made a mistake in the credit award algorithm and had corrected for that.
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#58124 |
|
|
|
That task was short though. The threshold is around 2million credit reward if I remember.
I posted about it in the team forum almost exactly a year ago. Don’t want to post some details publicly because it could encourage cheating. But for a long time credit reward of the beta tasks has been inconsistent and not calculated fairly IMO. Because the credit reward was so high, I noticed a trend that when the credit reward was supposed to be high enough (extrapolating the runtime with expected reward) it triggered a very low value. This only happened on long running (and hence potential high reward) tasks. Since these tasks are so long, I just think there’s a possibility we’ll see that again.
____________
|
|
|
|
confirmed.
Keith you just reported this one.
http://www.gpugrid.net/result.php?resultid=32731284
that value of 34,722.22 is the exact same "penalty value" i noticed before a year ago. for 11hrs worth of work (clock time). and 28hrs of "cpu time". interesting that the multithreaded nature of these tasks inflates the run time so much.
extrapolating from your successful run that did not hit a penalty, I'd guess that any task longer than about 2.5hrs is gonna hit the penalty value for these tasks. they really should just use the same credit scheme as acemd3. or assign static credit scaled for expected runtime, as long as all of the tasks are about the same size.
BOINC documentation confirms my suspicions on what's happening.
https://boinc.berkeley.edu/trac/wiki/CreditNew
Peak FLOP Count
This system uses the Peak-FLOPS-based approach, but addresses its problems in a new way.
When a job J is issued to a host, the scheduler computes peak_flops(J) based on the resources used by the job and their peak speeds.
When a client finishes a job and reports its elapsed time T, we define peak_flop_count(J), or PFC(J) as
PFC(J) = T * peak_flops(J)
The credit for a job J is typically proportional to PFC(J), but is limited and normalized in various ways.
Notes:
PFC(J) is not reliable; cheaters can falsify elapsed time or device attributes.
We use elapsed time instead of actual device time (e.g., CPU time). If a job uses a resource inefficiently (e.g., a CPU job that does lots of disk I/O) PFC() won't reflect this. That's OK. The key thing is that BOINC allocated the device to the job, whether or not the job used it efficiently.
peak_flops(J) may not be accurate; e.g., a GPU job may take more or less CPU than the scheduler thinks it will. Eventually we may switch to a scheme where the client dynamically determines the CPU usage. For now, though, we'll just use the scheduler's estimate.
One-time cheats
For example, claiming a PFC of 1e304.
This is handled by the sanity check mechanism, which grants a default amount of credit and treats the host with suspicion for a while.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yep, I saw that. Same credit as before and now I remember this bit of code being brought up before back in the old Seti days.
@Abouh needs to be made aware of this and assign fixed credit as what they do with acemd3.
|
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
Awoke to find 4 PythonGPU WUs running on 3 computers. All had OPN & TN-Grid WUs running with CPU use flat-lined at 100%. Suspended all other CPU WUs to see what PG was using and got a band mostly contained in the range 20 to 40%. Then I tried a couple of scenarios.
1. Rig-44 has an i9-9980XE 18c36t 32 GB with 16 GB swap file, SSD, and 2 x 2080 Ti's. The GPU use is so low I switched GPU usage to 0.5 for both OPNG and PG and reread config files. OPNG WUs started running and have all been reported fine. PG WUs kept running. Then I started adding back in gene_pcim WUs. When I exceeded 4 gene_pcim WUs the CPU use bands changed shape in a similar way to Rig-24 with a tight band around 30% and a number of curves bouncing off 100%.
2. Rig-26 has an E5-2699 22c44t 32 GB with 16 GB swap (yet to be used), SSD, and a 2080 Ti. I've added back 24 gene_pcim WUs and the CPU use band has moved up to 40-80% with no peaks hitting 100%. Next I changed GPU usage to 0.5 for both OPNG and PG and reread config files. Both seem to be running fine.
3. Rig-24 has an i7-6980X 10c20t 32 GB with a 16 GB swap file, SSD, and a 2080 Ti. This one has been running for 17 hours so far with the last 2 hours having all other CPU work suspended. Its CPU usage graph looks different. There's a tight band oscillating about 20% with a single band oscillating from 60 to 90%. Since PG wants 32 CPUs and this CPU only has 20 there's a constant queue for hyperthreading to feed in. I'll let this one run by itself hoping it finishes soon.
Note: TN-Grid usually runs great in Resource Zero Mode where it rarely ever sends more than one extra WU. With PG running and app_config reducing the max running WUs TN-Grid just keeps sending more WUs. Up to 280 now. |
|
|
|
I did something similar with my two 7xGPU systems.
limited to 5 tasks concurrently.
and set the app_config files up such that it would run either 3x Einstein per GPU, OR 1xEinstein + 1x GPUGRID since the resources used by both are complimentary.
set GPUGRID to 0.6 for GPU use (prevents two from running on the same GPU, 0.6+0.6 >1.0)
set Einstein to 0.33 for GPU use (allows three to run on a single GPU or one GPUGRID + one Einstein, 0.33+0.33+0.33<1.0, 0.6+0.33<1.0)
but running 5 tasks on a system with 64GB system memory was too ambitious, ram use was initially OK, but grew to fill system ram and swap (default 2GB).
if these tasks become more common and plentiful, I might consider upgrading these 7xGPU systems to 128GB RAM so that they can handle running on all GPUs at the same time, but not going to bother if the project decides to reduce the system requirements or these pop up very infrequently.
the low credit reward per unit time due to the BOINC credit fail safe default value should be fixed though. not many people will have much incentive to test out the beta tasks with 10-20x less credit per unit time.
oh and these don't checkpoint properly (they checkpoint once very early on). if you pause a task that's been running for 20hrs, it restarts from that first checkpoint 20hrs ago.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone,
The batch I sent on Friday was successfully completed, even if some jobs failed several times initially and got reassigned.
I went through all failed jobs. Here I summarise some errors I have seen:
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
2. Conda environment conflicts with package pinocchio. This one I talked about in a previous post. It requires resetting the app.
3. ´INTERNAL ERROR: cannot create temporary directory!´ - I understand this one could be due to a full disk.
Also, based on the feedback I will work on fixing the following things before the next batch:
1. Checkpoints will be created more often during training. So jobs can be restarted and won’t go back to the beginning.
2. Credits assigned. The idea is to progressively increase the credits until the credit return becomes similar to that of the acemd jobs. However, devising a general formula to calculate them is more complex in this case. For now it is based in the total amount of data samples gathered from the environments and used to train the AI agent, but that does not take into account the size of the agent neural networks. For now we will keep them fixed, but to solve other problems might be necessary to adjust them.
Finally, I think I was a bit too ambitious regarding the total amount of training per job. I will break jobs down in two, so they don't take that long to complete.
____________
|
|
|
|
thanks!
I did notice that all of mine failed with exceeded time limit.
might be a good idea to increase the estimated flops size of these tasks so BOINC knows that they are large and will run for a long time.
____________
|
|
|
|
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
I've tried to set preferences at all my less than 6GB RAM GPU hosts for not receiving Python Runtime (GPU, beta) app:
Run only the selected applicationsACEMD3: yes
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): yes
Python Runtime (CPU, beta): yes
Python Runtime (GPU, beta): no
If no work for selected applications is available, accept work from other applications?: no
But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...
Task e1a1-ABOU_rnd_ppod_8-0-1-RND5560_0
RuntimeError: CUDA out of memory. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...
my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?
|
|
|
PDWSend message
Joined: 7 Mar 14 Posts: 15 Credit: 4,908,594,525 RAC: 31,921,568 Level
Scientific publications
|
But I've still received one more Python GPU task at one of them.
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...
I had the same problem, you need to set the 'Run test applications' to No
It looks like having that set to Yes will over ride any specific application setting you set. |
|
|
|
Thanks, I'll try |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
This makes me to get in doubt whether GPUGRID preferences are currently working as intended or not...
my question is a different one: as long as the GPUGRID team now concentrates on Python, no more ACEMD tasks will come?
Hard to say. Toni and Gianni both stated the work would be very limited and infrequent until they can fill the new PhD positions.
But there have been occasional "drive-by" drops of cryptic scout work I've noticed along with the occasional standard research acemd3 resend.
Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Sounds like @abouh is getting ready to drop a larger debugged batch of Python on GPU tasks.
Would be great if they work on Windows, too :-)
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Today I will send a couple of batches with short tasks for some final debugging of the scripts and then later I will send a big batch of debugged tasks.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The idea is to make it work for Windows in the future as well, once it works smoothly on linux.
____________
|
|
|
|
Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.
____________
|
|
|
|
Thanks, looks like they are small enough to fit on a 16GB system now. using about 12GB.
not sure what happened to it. take a look.
https://gpugrid.net/result.php?resultid=32731651
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Looks like a needed package was not retrieved properly with a "deadline exceeded" error. |
|
|
|
Looks like a needed package was not retrieved properly with a "deadline exceeded" error.
It's interesting, looking at the stderr output. it appears that this app is communicating over the internet to send and receive data outside of BOINC. and to servers that are not belonging to the project.
(i think the issue is that I was connected to my VPN checking something else and I left the connection active and it might have had an issue reaching the site it was trying to access)
not sure how kosher that is. I think BOINC devs don't intend/desire this kind of behavior. some people might have some security concerns of the app doing these things outside of BOINC. might be a little smoother to do all communication only between the host and the project and only via the BOINC framework. if data needs to be uploaded elsewhere, it might be better for the project to do that on the backend.
just my .02
____________
|
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.
I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.
I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
The idea is to make it work for Windows in the future as well, once it works smoothly on linux.
okay, sounds good; thanks for the information
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I'm running one of the new batch and at first the task was only using 2.2GB of gpu memory but now it has clocked backup to 6.6GB of gpu memory.
Much as the previous ones. I thought the memory requirements were going to be cut in half.
Consuming the same amount of system memory as before . . . maybe a couple of GB more in fact. Up to 20GB now. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
Just had one that's listed as "aborted by user." I didn't abort it.
https://www.gpugrid.net/result.php?resultid=32731704
It also says "Please update your install command." I've kept my computer updated. Is this something I need to do?
What's this? Something I need to do or not?
"FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`"
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 0; 11.77 GiB total capacity; 3.05 GiB already allocated; 50.00 MiB free; 3.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
That error on 4 tasks right around 55 minutes on 3080Ti
The same PC/GPU has complete Python tasks before, one earlier that ran for 1900 seconds and is running one now for 9hr. Util is around 2-3% and 6.5GB memory in nvidia-smi. 6.1GB in BOINC.
3070Ti has been running for 7:45 with 8% Util and same memory usage. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
The ray errors are normal and can be ignored.
I completed one of the new tasks successfully. The one I commented on before.
14 hours of compute time.
I had another one that completed successfully but the stderr.txt was truncated and does not show the normal summary and boinc finish statements. Feels similar to the truncation that Einstein stderr.txt outputs have. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
1. Detected multiple CUDA out of memory errors. Locally the jobs use 6GB of GPU memory. It seems difficult to lower the GPU memory requirements for now, so jobs running in GPUs with less memory should fail.
I'm getting CUDA out of memory failures and all my cards have 10 to 12 GB of GDDR: 1080 Ti, 2080 Ti, 3080 Ti and 3080. There must be something else going on.
I've also stopped trying to time-slice with PythonGPU. It should have a dedicated GPU and I'm leaving 32 CPU threads open for it.
I keep looking for Pinocchio but have yet to see him. Where does it come from? Maybe I never got it.
I'm not doing anything at all in mitigation for the Python on GPU tasks other than to only run one at a time. I've been successful in almost all cases other than the very first trial ones in each evolution. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.
The GPU memory and system memory will remain the same in the next batches.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
During the task, the performance of the Agent is intermittently sent to https://wandb.ai/ to track how the agent is doing in the environment as training progresses. It immensely helps to understand the behaviour of the agent and facilitates research, as it allows visualising the information in a structured way.
wandb has a python package extensively used in machine learning research, which we import in our scripts for this purpose.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Pinocchio probably only caused problems in a subset of hosts, as it was due to one of the firsts test batches having a wrong conda environment requirements file. It was a small batch.
____________
|
|
|
|
My machines are probably just above the minimum spec for the current batches - 16 GB RAM, and 6 GB video RAM on a GTX 1660.
They've both completed and validated their first task, in around 10.5 / 11 hours.
But there's something odd about the result display in the task listing on this website - both the Run time and CPU time columns show the exact same value, and it's too large to be feasible: task 32731629, for example, shows 926 minutes of run time, but only 626 minutes between issue and return.
Tasks currently running locally show CPU time so far about 50% above elapsed time, which is to be expected from the description of how these tasks are designed to run. I suspect that something is triggering an anti-cheat mechanism: a task specified to use a single CPU core couldn't possibly use the CPU for longer than the run time, could it? But if so, it seems odd to 'correct' the elapsed time rather than the CPU time.
I'll take a look at the sched_request file after the next one reports, to see if the 'correction' is being applied locally by the BOINC client, or on the server. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
What was halved was the amount of Agent training per task, and therefore the total amount of time required to completed it.
The GPU memory and system memory will remain the same in the next batches.
Halved? I've got one at nearly 21.5 hours on a 3080Ti and still going
|
|
|
|
This shows the timing discrepancy, a few minutes before task 32731655 completed.
The two valid tasks on host 508381 ran in sequence on the same GPU: there's no way they could have both finished within 24 hours if the displayed elapsed time was accurate. |
|
|
|
i still think the 5,000,000 GFLOPs count is far too low. since these run for 12-24hrs depending on host (GPU speed does not seem to be a factor in this since GPU utilization is so low, most likely CPU/memory bound) and there seems to be a bit of a discrepancy in run time per task. I had a task run for 9hrs on my 3080Ti, while another user claims 21+ hrs on his 3080Ti. and I've had several tasks get killed around 12hrs for exceeded time limit, while others ran for longer. lots of inconsistencies here.
the low flops count is causing a lot of tasks to prematurely get killed by BOINC for exceeded time limit when they would have completed eventually. the fact that they do not proceed past 10% completion until the end probably doesn't help.
____________
|
|
|
|
Because this project still uses DCF, the 'exceeded time limit' problem should go away as soon as you can get a single task to complete. Both my machines with finished tasks are now showing realistic estimates, but with DCFs of 5+ and 10+ - I agree, the FLOPs estimate should be increased by that sort of multiplier to keep estimates balanced against other researchers' work for the project.
The screen shot also shows how the 'remaining time' estimate gets screwed up when the running value reaches something like 10 hours at 10%. Roll on intermediate progress reports and checkpoints. |
|
|
|
my system that completed a few tasks had a DCF of 36+
checkpointing also still isn't working. I had some tasks running for ~3hrs. restarted boinc and they restarted at 5mins.
____________
|
|
|
|
checkpointing also still isn't working.
See my screenshot.
"CPU time since checkpoint: 16:24:44" |
|
|
|
I've checked a sched_request when reporting.
<result>
<name>e1a26-ABOU_rnd_ppod_11-0-1-RND6936_0</name>
<final_cpu_time>55983.300000</final_cpu_time>
<final_elapsed_time>36202.136027</final_elapsed_time>
That's task 32731632. So it's the server applying the 'sanity(?) check' "elapsed time not less than CPU time". That's right for a single core GPU task, but not right for a task with multithreaded CPU elements. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
As mentioned by Ian&Steve C., GPU speed influences only partially task completion time.
During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.
In the last batch, I reduced the total amount of agent-environment interactions gathered and processed before ending the task with respect to the previous batch, which should have reduced the completion time.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I will look into the reported issues before sending the next batch, to see if I can find a solution for both the problem of jobs being killed due to “exceeded time limit” and the progress and checkpointing problems.
From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?
Thanks you very much for your feedback. Happy holidays to everyone!
____________
|
|
|
|
From what Ian&Steve C. mentioned, I understand that increasing the "Estimated Computation Size", however BOINC calculates that, could solve the problem of jobs being killed?
The jobs reach us with a workunit description:
<workunit>
<name>e1a24-ABOU_rnd_ppod_11-0-1-RND1891</name>
<app_name>PythonGPU</app_name>
<version_num>401</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>
<rsc_memory_bound>4000000000.000000</rsc_memory_bound>
<rsc_disk_bound>10000000000.000000</rsc_disk_bound>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-run</file_name>
<open_name>run.py</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-data</file_name>
<open_name>input.zip</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-requirements</file_name>
<open_name>requirements.txt</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>e1a24-ABOU_rnd_ppod_11-0-input_enc</file_name>
<open_name>input</open_name>
<copy_file/>
</file_ref>
</workunit>
It's the fourth line, '<rsc_fpops_est>', which causes the problem. The job size is given as the estimated number of floating point operations to be calculated, in total. BOINC uses this, along with the estimated speed of the device it's running on, to estimate how long the task will take. For a GPU app, it's usually the speed of the GPU that counts, but in this case - although it's described as a GPU app - the dominant factor might be the speed of the CPU. BOINC doesn't take any direct notice of that.
The jobs are killed when they reach the duration calculated from the next line, '<rsc_fpops_bound>'. A quick and dirty fix while testing might be to increase that value even above the current 50x the original estimate, but that removes a valuable safeguard during normal running. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I see, thank you very much for the info. I asked Toni to help me adjusting the "rsc_fpops_est" parameter. Hopefully the next jobs won't be aborted by the server.
Also, I checked the progress and the checkpointing problems. They were caused by format errors.
The python scripts were logging the progress into a "progress.txt" file but apparently BOINC wants just a file "progress" without extension.
Similarly, checkpoints were being generated, but were not identified correctly since they were not called "restart.chk".
I will work on fixing these issues before the next batch of tasks.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Thanks @abouh for working with us in debugging your application and work units.
Nice to have a attentive and easy to work with researcher.
Looking forward to the next batch. |
|
|
|
Thank you for your kind support.
During the task, the agent first interacts with the environments for a while, then uses the GPU to process the collected data and learn from it, then interacts again with the environments, and so on.
This behavior can be seen at some tests described at my Managing non-high-end hosts thread. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I just sent another batch of tasks.
I tested locally and the progress and the restart.chk files are correctly generated and updated.
rsc_fpops_est job parameter should be higher too now.
Please let us know if you think the success rate of tasks can be improved in any other way. Thanks a lot for your help.
____________
|
|
|
|
I just sent another batch of tasks.
Thank you very much for this kind of Christmas present!
Merry Christmas to everyone crunchers worldwide 🎄✨ |
|
|
|
1,000,000,000 GFLOPs - initial estimate 1690d 21:37:58. That should be enough!
I'll watch this one through, but after that I'll be away for a few days - happy holidays, and we'll pick up again on the other side.
Edit: Progress %age jumps to 10% after the initial unpacking phase, then increments every 0.9%. That'll do. |
|
|
|
I tested locally and the progress and the restart.chk files are correctly generated and updated.
rsc_fpops_est job parameter should be higher too now.
In a preliminary sight of one new Python GPU task received today:
- Progress estimation is now working properly, updating by 0,9% increments.
- Estimated computation size has raised to 1,000,000,000 GFLOPs, as also confirmed by Richard Haselgrove
- Checkpointing seems to be working also, and is being stored at about every two minutes.
- Learning cycle period has reduced to 11 seconds from 21 seconds observed at previous task. sudo nvidia-smi dmon
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)
- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442
Well done! |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Same observed behavior. Gpu memory halved, progress indicator normal and GFLOPS in line with actual usage.
Well done. |
|
|
|
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)
I'm answering to myself: I enabled Python GPU tasks requesting in my GTX 1650 SUPER 4 GB system, and I happened to catch this previously failed task e1a21-ABOU_rnd_ppod_13-0-1-RND2308_1
This task has passed the initial processing steps, and has reached the learning cycle phase.
At this point, memory usage is just at the limit of the 4 GB GPU available RAM.
Waiting to see whether this task will be succeeding or not.
System RAM usage keeps being very high.
99% of the 16 GB available RAM at this system is currently in use. |
|
|
|
- Currrent progress for task e1a20-ABOU_rnd_ppod_13-0-1-RND1192_0 is 28,9% after 2 hours and 13 minutes running. This leads to a total true execution time of about 7 hours and 41 minutes at my Host #569442
That's roughly the figure I got in the early stages of today's tasks. But task 32731884 has just finished with
<result>
<name>e1a17-ABOU_rnd_ppod_13-0-1-RND0389_3</name>
<final_cpu_time>59637.190000</final_cpu_time>
<final_elapsed_time>39080.805144</final_elapsed_time>
That's very similar (and on the same machine) as the one I reported in message 58193. So I don't think the task duration has changed much: maybe the progress %age isn't quite linear (but not enough to worry about). |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello,
reviewing which jobs failed in the last batches I have seen several times this error:
21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper (7.7.26016): starting
21:28:07 (152316): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
[152341] INTERNAL ERROR: cannot create temporary directory!
[152345] INTERNAL ERROR: cannot create temporary directory!
21:28:08 (152316): /usr/bin/flock exited; CPU time 0.147100
21:28:08 (152316): app exit status: 0x1
21:28:08 (152316): called boinc_finish(195
I have found an issue from Richard Haselgrove talking about this error: https://github.com/BOINC/boinc/issues/4125
It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?
____________
|
|
|
|
It seems like the users getting this error could simply solve it by setting PrivateTmp=true. Is that correct? What is the appropriate way to modify that?
Right.
I gave a step-by-step solution based on Richard Haselgrove finding at my Message #55986
It worked fine for all my hosts. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you!
____________
|
|
|
|
Some new (to me) errors in https://www.gpugrid.net/result.php?resultid=32732017
"During handling of the above exception, another exception occurred:"
"ValueError: probabilities are not non-negative" |
|
|
|
it seems checkpointing still isnt working correctly.
despite BOINC "claiming" that it's checkpointing X number of seconds ago, stopping BOINC and re-starting shows that it's not restarting from the checkpoint.
The task I currently have in progress was ~20% completed. stopped BOINC, and restarted and it retained the time (elapsed and CPU time) but progress reset to 10%.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I saw the same issue on my last task which was checkpointed past 20% yet reset to 10% upon restart. |
|
|
|
- GPU dedicated RAM usage seems to have been reduced, but I don't know if enough for running at 4 GB RAM GPUs (?)
Two of my hosts with 4 GB dedicated RAM GPUs have succeeded their latest Python GPU tasks so far.
If it is planned to be kept GPU RAM requirements this way, it widens the app to a quite greater number of hosts.
Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).
I grossly estimate that environment for each Python task takes about 16 GB system RAM.
I guess that an eventual third concurrent task might have crashed itself, or even crashed the whole three Python tasks due to lack of system RAM.
I was watching to Psensor readings when the first of the two Python tasks finished, and then the free system memory drastically increased again from 1% to 38%.
I also took a nvidia-smi screenshot, where can be seen that each Python task was respectively running at GPU 0 and GPU 1, while GPU 2 was processing a PrimeGrid CUDA GPU task.
|
|
|
|
now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Regarding the checkpointing problem, the approach I follow is to check the progress file (if exists) at the beginning of the python script and then continue the job from there.
I have tested locally to stop the task and execute again the python script and it continues from the same point where it stopped. So the script seems correct.
However, I think that right after setting up the conda environment, the progress is set automatically to 10% before running my script, so I am guessing this is what is causing the problem. I have modified my code not to rely only on the progress file, since it might be overwritten after every conda setup to be at 10%.
____________
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
now that I've upgraded my single 3080Ti host from a 5950X w/16GB ram to a 7402P/128GB ram, I want to see if I can even run 2x GPUGRID tasks on the same GPU. I see about 5GB VRAM use on the tasks I've processed so far. so with so much extra system ram and 12GB VRAM, it might work lol.
The last two tasks on my system with a 3080Ti ran concurrently and completed successfully.
https://www.gpugrid.net/results.php?hostid=477247 |
|
|
|
Errors in e6a12-ABOU_rnd_ppod_15-0-1-RND6167_2 (created today):
"wandb: Waiting for W&B process to finish, PID 334655... (failed 1). Press ctrl-c to abort syncing."
"ValueError: demo dir contains more than ´total_buffer_demo_capacity´" |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
One user mentioned that could not solve the error
INTERNAL ERROR: cannot create temporary directory!
This is the configuration he is using:
### Editing /etc/systemd/system/boinc-client.service.d/override.conf
### Anything between here and the comment below will become the new
contents of the file
PrivateTmp=true
### Lines below this comment will be discarded
### /lib/systemd/system/boinc-client.service
# [Unit]
# Description=Berkeley Open Infrastructure Network Computing Client
# Documentation=man:boinc(1)
# After=network-online.target
#
# [Service]
# Type=simple
# ProtectHome=true
# ProtectSystem=strict
# ProtectControlGroups=true
# ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
# Nice=10
# User=boinc
# WorkingDirectory=/var/lib/boinc
# ExecStart=/usr/bin/boinc
# ExecStop=/usr/bin/boinccmd --quit
# ExecReload=/usr/bin/boinccmd --read_cc_config
# ExecStopPost=/bin/rm -f lockfile
# IOSchedulingClass=idle
# # The following options prevent setuid root as they imply
NoNewPrivileges=true
# # Since Atlas requires setuid root, they break Atlas
# # In order to improve security, if you're not using Atlas,
# # Add these options to the [Service] section of an override file using
# # sudo systemctl edit boinc-client.service
# #NoNewPrivileges=true
# #ProtectKernelModules=true
# #ProtectKernelTunables=true
# #RestrictRealtime=true
# #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# #RestrictNamespaces=true
# #PrivateUsers=true
# #CapabilityBoundingSet=
# #MemoryDenyWriteExecute=true
# #PrivateTmp=true #Block X11 idle detection
#
# [Install]
# WantedBy=multi-user.target
I was just wondering if there is any possible reason why it should not work
____________
|
|
|
|
I am using a systemd file generated from a PPA maintained by Gianfranco Costamagna. It's automatically generated from Debian sources, and kept up-to-date with new releases automatically. It's currently supplying a BOINC suite labelled v7.16.17
The full, unmodified, contents of the file are
[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target
[Service]
Type=simple
ProtectHome=true
PrivateTmp=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
[Install]
WantedBy=multi-user.target
That has the 'PrivateTmp=true' line in the [Service] section of the file, rather than isolated at the top as in your example. I don't know Linux well enough to know how critical the positioning is.
We had long discussions in the BOINC development community a couple of years ago, when it was discovered that the 'PrivateTmp=true' setting blocked access to BOINC's X-server based idle detection. The default setting was reversed for a while, until it was discovered that the reverse 'PrivateTmp=false' setting caused the problem creating temporary directories that we observe here. I think that the default setting was reverted to true, but the discussion moved into the darker reaches of the Linux package maintenance managers, and the BOINC development cycle became somewhat disjointed. I'm no longer fully up-to-date with the state of play. |
|
|
|
A simpler answer might be
### Lines below this comment will be discarded
so the file as posted won't do anything at all - in particular, it won't run BOINC! |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you! I reviewed the code and detected the source of the error. I am currently working to solve it.
I will do local tests and then send a small batch of short tasks to GPUGrid to test the fixed version of the scripts before sending the next big batch.
____________
|
|
|
|
Everybody seems to be getting the same error in today's tasks:
"AttributeError: 'PPODBuffer' object has no attribute 'num_loaded_agent_demos'" |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I believe I got one of the test, fixed tasks this morning based on the short crunch time and valid report.
No sign of the previous error.
https://www.gpugrid.net/result.php?resultid=32732671 |
|
|
|
Yes, your workunit was "created 7 Jan 2022 | 17:50:07 UTC" - that's a couple of hours after the ones I saw. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I just sent a batch that seems to fail with
File "/var/lib/boinc-client/slots/30/python_dependencies/ppod_buffer_v2.py", line 325, in before_gradients
if self.iter % self.save_demos_every == 0:
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'
For some reason it did not crash locally. "Fortunately" it will crash after only a few minutes, and it is easy to solve. I am very sorry for the inconvenience...
I will send also a corrected batch with tasks of normal duration. I have tried to reduce the GPU memory requirements a bit in the new tasks.
____________
|
|
|
|
Got one of those - failed as you describe.
Also has the error message "AttributeError: 'GWorker' object has no attribute 'batches'".
Edit - had a couple more of the broken ones, but one created at 10:40:34 UTC seems to be running OK. We'll know later! |
|
|
FritzBSend message
Joined: 7 Apr 15 Posts: 12 Credit: 2,769,441,100 RAC: 3,409,357 Level
Scientific publications
|
I got 20 bad WU's today on this host: https://www.gpugrid.net/results.php?hostid=520456
Stderr Ausgabe
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper (7.7.26016): starting
13:25:53 (6392): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p gpugridpy --file requirements.txt ")
0%| | 0/45 [00:00<?, ?it/s]
concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
File "concurrent/futures/process.py", line 368, in _queue_management_worker
File "multiprocessing/connection.py", line 251, in recv
TypeError: __init__() missing 1 required positional argument: 'msg'
'''
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "entry_point.py", line 69, in <module>
File "concurrent/futures/process.py", line 484, in _chain_from_iterable_of_lists
File "concurrent/futures/_base.py", line 611, in result_iterator
File "concurrent/futures/_base.py", line 439, in result
File "concurrent/futures/_base.py", line 388, in __get_result
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
[6689] Failed to execute script entry_point
13:25:58 (6392): /usr/bin/flock exited; CPU time 3.906269
13:25:58 (6392): app exit status: 0x1
13:25:58 (6392): called boinc_finish(195)
</stderr_txt>
]]>
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I errored out 12 tasks created from 10:09:55 to 10:40:06.
Those all have the batch error.
But have 3 tasks created from 10:41:01 to 11:01:56 still running normally |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
And two of those were the batch error resends that now have failed.
Only 1 still processing that I assume is of the fixed variety. 8 hours elapsed currently.
https://www.gpugrid.net/result.php?resultid=32732855 |
|
|
|
You need to look at the creation time of the master WU, not of the individual tasks (which will vary, even within a WU, let alone a batch of WUs). |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I have seen this error a few times.
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Do you think it could be due to a lack of resources? I think Linux starts killing processes if you are over capacity.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Might be the OOM-Killer kicking in. You would need to grep -i kill /var/log/messages*
to check if processes were killed by the OOM-Killer.
If that is the case you would have to configure /etc/sysctl.conf to let the system be less sensitive to brief out of memory conditions. |
|
|
|
I Googled the error message, and came up with this stackoverflow thread.
The problem seems to be specific to Python, and arises when running concurrent modules. There's a quote from the Python manual:
"The main module must be importable by worker subprocesses. This means that ProcessPoolExecutor will not work in the interactive interpreter. Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock."
Other search results may provide further clues. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks! out of the possible explanations that could cause the error listed in the thread, I suspect it could be OS killing the threads do to a lack of resources. Could be not enough RAM, or maybe python raises this error if the ratio cores / processes is high? (I have seen some machines with 4 CPUs, and the tasks spawns 32 reinforcement learning environments).
All tasks run the same code and in the majority of GPUGrid machines this error does no occur. Also, I have reviewed the failed jobs and this errors always occurs in the same hosts. So it is something specific to those machines. I will check if I find a common patterns in all hosts that get this error.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
What version of Python are the hosts that have the errors running?
Mine for example is:
python3 --version
Python 3.8.10
What kernel and OS?
Linux 5.11.0-46-generic x86_64
Ubuntu 20.04.3 LTS
I've had the errors on hosts with 32GB and 128GB. I would assume the hosts with 128GB to be in the clear with no memory pressures. |
|
|
|
What version of Python are the hosts that have the errors running?
Mine for example is:
python3 --version
Python 3.8.10
Same Python version as current mine.
In case of doubt about conflicting Python versions, I published the solution that I applied to my hosts at Message #57833
It worked for my Ubuntu 20.04.3 LTS Linux distribution, but user mmonnin replied that this didn't work for him.
mmonnin kindly published an alternative way at his Message #57840 |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
I saw the prior post and was about to mention the same thing. Not sure which one works as the PC has been able to run tasks.
The recent tasks are taking a really long time
2d13h 62,2% 1070 and 1080 GPU system
2d15h 60.4% 1070 and 1080 GPU system
2x concurrently on 3080Ti
2d12h 61.3%
2d14h 60.4% |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
All jobs should use the same python version (3.8.10), I define it in the requirements.txt file of the conda environment.
Here are the specs from 3 hosts that failed with the BrokenProcessPool error:
OS:
Linux Debian Debian GNU/Linux 11 (bullseye) [5.10.0-10-amd64|libc 2.31 (Debian GLIBC 2.31-13+deb11u2)]
Linux Ubuntu Ubuntu 20.04.3 LTS [5.4.0-94-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.3)]
Linux Linuxmint Linux Mint 20.2 [5.4.0-91-generic|libc 2.31 (Ubuntu GLIBC 2.31-0ubuntu9.2)]
Memory:
32081.92 MB
32092.04 MB
9954.41 MB
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I have a failed task today involving pickle.
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input
When I was investigating the brokenprocesspool error I saw posts that involved the word pickle and the fixes for that error.
https://www.gpugrid.net/result.php?resultid=32733573 |
|
|
|
The tasks run on my Tesla K20 for a while, but then fail when they need to use PyTorch, which requires higher CUDA Capability. Oh well. Guess I'll stick to the ACEMED tasks. The error output doesn't list the requirements properly, but from a little Googling, it was updated to require 3.7 within the past couple years. The only Kepler card that has 3.7 is the Tesla K80.
From this task:
[W NNPACK.cpp:79] Could not initialize NNPACK! Reason: Unsupported hardware.
/var/lib/boinc-client/slots/2/gpugridpy/lib/python3.8/site-packages/torch/cuda/__init__.py:120: UserWarning:
Found GPU%d %s which is of cuda capability %d.%d.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is %d.%d.
While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla. |
|
|
|
While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.
this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.
with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.
within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.
all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.
____________
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
While I'm here, is there any way to force the project to update my hardware configuration? It thinks my host has two Quadro K620s instead of one of those and the Tesla.
this is a problem (feature?) of BOINC, not the project. the project only knows what hardware you have based on what BOINC communicates to the project.
with cards from the same vendor (nvidia/AMD/Intel) BOINC only lists the "best" card and then appends a number that's associated with how many total devices you have from that vendor. it will only list different models if they are from different vendors.
within the nvidia vendor group, BOINC figures out the "best" device by checking the compute capability first, then memory capacity, then some third metric that i cant remember right now. BOINC deems the K620 to be "best" because it has a higher compute capability (5.0) than the Tesla K20 (3.5) even though the K20 is arguably the better card with more/faster memory and more cores.
all in all, this has nothing to do with the project, and everything to do with BOINC's GPU ranking code.
Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308
This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070. |
|
|
|
Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308
This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.
In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
____________
|
|
|
|
Ah, I get it. I thought it was just stuck, because it did have two K620s before. I didn't realize BOINC was just incapable of acknowledging different cards from the same vendor. Does this affect project statistics? The Milkyway@home folks are gonna have real inflated opinions of the K620 next time they check the numbers haha |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Interesting I had seen this error once before locally, and I assumed it was due to a corrupted input file.
I have reviewed the task and it was solved by another hosts, but only after multiple failed attempts with this pickle error.
Thank you for bringing it up! I will review the code to see if I can find any bug related to that.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
This is the document I had found about fixing the BrokenProcessPool error.
https://stackoverflow.com/questions/57031253/how-to-fix-brokenprocesspool-error-for-concurrent-futures-processpoolexecutor
I was reading it and stumbled upon the word "pickle" and verb "picklable" and thought it funny and I never had heard that word associated with computing before.
When the latest failed task mentioned pickle in the output, it tied it right back to all the previous BrokenProcessPool errors. |
|
|
klepelSend message
Joined: 23 Dec 09 Posts: 189 Credit: 4,718,504,360 RAC: 2,214,787 Level
Scientific publications
|
@abouh: Thank you for PM me twice!
The Experimental Python tasks (beta) succeed miraculously on my two Linux computers (which produced only errors) after several restarts of GPUGRID.net project and the latest distro update this week.
|
|
|
|
Also I happened to catch two simultaneous Python tasks at my triple GTX 1650 GPU host.
I then urgently suspended requesting for Gpugrid tasks at BOINC Manager... Why?
This host system RAM size is 32 GB.
When the second Python task started, free system RAM decreased to 1% (!).
After upgrading system RAM from 32 GB to 64 GB at above mentioned host, it has successfully processed three concurrent ABOU Python GPU tasks:
e2a43-ABOU_rnd_ppod_baseline_rnn-0-1-RND6933_3 - Link: https://www.gpugrid.net/result.php?resultid=32733458
e2a21-ABOU_rnd_ppod_baseline_rnn-0-1-RND3351_3 - Link: https://www.gpugrid.net/result.php?resultid=32733477
e2a27-ABOU_rnd_ppod_baseline_rnn-0-1-RND5112_1 - Link: https://www.gpugrid.net/result.php?resultid=32733441
More details at regarding Message #58287 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone,
I have seen a new error in some jobs:
Traceback (most recent call last):
File "run.py", line 444, in <module>
main()
File "run.py", line 62, in main
wandb.login(key=str(args.wandb_key))
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 65, in login
configured = _login(**kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 268, in _login
wlogin.configure_api_key(key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/wandb_login.py", line 154, in configure_api_key
apikey.write_key(self._settings, key)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/lib/apikey.py", line 223, in write_key
api.clear_setting("anonymous", globally=True, persist=True)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 75, in clear_setting
return self.api.clear_setting(*args, **kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/apis/internal.py", line 19, in api
self._api = InternalApi(*self._api_args, **self._api_kwargs)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/sdk/internal/internal_api.py", line 78, in __init__
self._settings = Settings(
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 23, in __init__
self._global_settings.read([Settings._global_path()])
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/old/settings.py", line 110, in _global_path
util.mkdir_exists_ok(config_dir)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/site-packages/wandb/util.py", line 793, in mkdir_exists_ok
os.makedirs(path)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/boinc-client/slots/1/gpugridpy/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/var/lib/boinc-client'
18:56:50 (54609): ./gpugridpy/bin/python exited; CPU time 42.541031
18:56:50 (54609): app exit status: 0x1
18:56:50 (54609): called boinc_finish(195)
</stderr_txt>
It seems like the task is not allowed to create a new dirs inside its working directory. Just wondering if it could be some kind of configuration problem, just like the "INTERNAL ERROR: cannot create temporary directory!" for which a solution was already shared.
____________
|
|
|
|
My question would be: what is the working directory?
The individual line errors concern
/home/boinc-client/slots/1/...
but the final failure concerns
/var/lib/boinc-client
That sounds like a mixed-up installation of BOINC: 'home' sounds like a location for a user-mode installation of BOINC, but '/var/lib/' would be normal for a service mode installation. It's reasonable for the two different locations to have different write permissions.
What app is doing the writing in each case, and what account are they running under?
Could the final write location be hard-coded, but the others dependent on locations supplied by the local BOINC installation? |
|
|
|
Hi
I've the same issue regarding boinc-directory (boinc dir is setup to ~/boinc)
So, I cleanup ~/.conda directory and reinstall gpugridnet project to the boinc client
So , flock detect the right running boinc directory but now I have this error task
https://www.gpugrid.net/result.php?resultid=32734225
./gpugridpy/bin/python (I think this is in boinc/slots/<N>/ folder)
The WU is running and 0.43% completed but /home/<user>/boinc/slots/11/gpugridpy still empty. No data are writted . |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Right so the working directory is
/home/boinc-client/slots/1/...
to which the script has full access. The script tries to create a directory to save the logs, but I guess it should not do it in
/var/lib/boinc-client
So I think the problem is just that the package I am using to log results by default saves them outside the working directory. Should be easy to fix.
____________
|
|
|
|
BOINC has the concept of a "data directory". Absolutely everything that has to be written should be written somewhere in that directory or its sub-directories. Everything else must be assumed to be sandboxed and inaccessible. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308
This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.
In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.
This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%. |
|
|
|
Its often said as the "Best" card but its just the 1st
https://www.gpugrid.net/show_host_detail.php?hostid=475308
This host has a 1070 and 1080 but just shows 2x 1070s as the 1070 is in the 1st slot. Any way to check for a "best" would come up with the 1080. Or the 1070Ti that used to be there with the 1070.
In your case, the metrics that BOINC is looking at are identical between the two cards (actually all three of the 1070, 1070Ti, and 1080 have identical specs as far as BOINC ranking is concerned). All have the same amount of VRAM and have the same compute capability. So the tie goes to device number I guess. If you were to swap the 1080 for even a weaker card with a better CC (like a GTX 1650) then that would get picked up instead, even when not in the first slot.
The PC now as 1080 and 1080Ti with the Ti having more VRAM. BOINC shows 2x 1080. The 1080 is GPU 0 in nvidia-smi and so have the other BOINC displayed GPUs. The Ti is in the physical 1st slot.
This PC happened to pick up two Python tasks. They aren't taking 4 days this time. 5:45 hr:min at 38.8% and 31 min at 11.8%.
what motherboard? and what version of BOINC?, your hosts are hidden so I cannot inspect myself. PCIe enumeration and ordering can be inconsistent against consumer boards. My server boards seem to enumerate starting from the slot furthest from the CPU socket, while most consumer boards are the opposite with device0 at the slot closest to the CPU socket.
or do you perhaps run a locked coproc_info.xml file, this would prevent any GPU changes from being picked up by BOINC if it can't write to the coproc file.
edit:
also I forgot that most versions of BOINC incorrectly detect nvidia GPU memory. they will all max out at 4GB due to a bug in BOINC. So to BOINC your 1080Ti has the same amount of memory as your 1080. and since the 1080Ti is still a pascal card like the 1080, it has the same compute capability, so you're running into the same specs between them all still
to get it to sort properly, you need to fix BOINC code, or use a GPU with higher or lower compute capability. put a Turing card in the system not in the first slot and BOINC will pick it up as GPU0
____________
|
|
|
|
The tests continue. Just reported e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1, with final stats
<result>
<name>e2a13-ABOU_rnd_ppod_baseline_cnn_nophi_2-0-1-RND9761_1</name>
<final_cpu_time>107668.100000</final_cpu_time>
<final_elapsed_time>46186.399529</final_elapsed_time>
That's an average CPU core count of 2.33 over the entire run - that's high for what is planned to be a GPU application. We can manage with that - I'm sure we all want to help develop and test the application for the coming research run - but I think it would be helpful to put more realistic usage values into the BOINC scheduler. |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
It's not a GPU application. It uses both CPU and GPU. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Do you mean changing some of the BOINC parameters like it was done in the case of <rsc_fpops_est>?
Is that to better define the resources required by the tasks?
____________
|
|
|
|
It would need to be done in the plan class definition. Toni said that you define your plan classes in C++ code, so there are some examples in Specifying plan classes in C++.
Unfortunately, the BOINC developers didn't consider your use-case of mixing CPU elements and GPU elements in the same task, so none of the examples really match - your app is a mixture of MT and CUDA classes. What we need (or at least, would like to see) at this end are realistic values for <avg_ncpus> and <coproc><count>. |
|
|
FritzBSend message
Joined: 7 Apr 15 Posts: 12 Credit: 2,769,441,100 RAC: 3,409,357 Level
Scientific publications
|
it seems to work better now but I've reached time limit after 1800sec
https://www.gpugrid.net/result.php?resultid=32734648
19:39:23 (6124): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.
I'm using:
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>
for all my hosts and they seem to like that. Haven't had any issues. |
|
|
|
I'm still running them at 1 CPU plus 1 GPU. They run fine, but when they are busy on the CPU-only sections, they steal time from the CPU tasks that are running at the same time - most obviously from CPDN.
Because these tasks are defined as GPU tasks, and GPU tasks are given a higher run priority than CPU tasks by BOINC ('below normal' against 'idle'), the real CPU project will always come off worst. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You could employ ProcessLasso on the apps and up their priority I suppose.
When I ran Windows, I really utilized that utility to make the apps run the way I wanted them to, and not how BOINC sets them up on its own agenda. |
|
|
|
I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.
I think that Python GPU App is very efficient in adapting to any amount of CPU cores, and taking profit of available CPU resources.
This seems to be in some way independent of ncpus parameter at Gpugrid app_config.xml
Setup at my twin GPU system is as follows:
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.49</cpu_usage>
</gpu_versions>
</app>
And setup for my triple GPU system is as follows:
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>0.33</cpu_usage>
</gpu_versions>
</app>
The finality for this is being able to respectively run two or three concurrent Python GPU tasks without reaching a full "1" CPU core (2 x 0.49 = 0.98; 3 x 0.33 = 0.99). Then, I manually control CPU usage by setting "Use at most XX % of the CPUs" at BOINC Manager for each system, according to its amount of CPU cores.
This allows me to run concurrently "N" Python GPU tasks and a fixed number of other CPU tasks as desired.
But as said, Gpugrid Python GPU app seems to take CPU resources as needed for successfully processing its tasks... at the cost of slowing down the other CPU applications. |
|
|
|
Yes, I use Process Lasso on all my Windows machines, but I haven't explored its use under Linux.
Remember that ncpus and similar has no effect whatsoever on the actual running of a BOINC project app - there is no 'control' element to its operation. The only effect it has is on BOINC's scheduling - how many tasks are allowed to run concurrently. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
This message
19:39:23 (6124): task /usr/bin/flock reached time limit 1800
Indicates that, after 30 minutes, the installation of miniconda and the task environment setup have not been finished.
Consequently, python is not found later on to execute the task since it is one of the requirements of the miniconda environment.
application ./gpugridpy/bin/python missing
Therefore, it is not an error in itself, it just means that the miniconda setup went too slow for some reason (in theory 30 minutes should be enough time). Maybe the machine is slower than usual for some reason. Or the connection is slow and dependencies are not being downloaded.
We could extend this timeout, but normally if 30 minutes is not enough for the miniconda setup another underlying problem could exists.
____________
|
|
|
|
it seems to be a reasonably fast system. my guess is another type of permissions issue which is blocking the python install and it hits the timeout, or the CPUs are being too heavily used and not giving enough resources to the extraction process.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
There is no Linux equivalent of Process Lasso.
But there is a Linux equivalent of Windows Process-Explorer
https://github.com/wolfc01/procexp
Screenshots of the application at the old SourceForge repo.
https://sourceforge.net/projects/procexp/
Can dynamically change the nice value of the application.
There is also the command line schedtool utility that can be easily implemented in a bash file. I used to run that all the time in my gpuoverclock.sh script for Seti cpu and gpu apps. |
|
|
|
Well, that got me a long way.
There are dependencies listed for Mint 18.3 - I'm running Mint 20.2
The apt-get for the older version of Mint returns
E: Unable to locate package python-qwt5-qt4
E: Unable to locate package python-configobj
Unsurprisingly, the next step returns
Traceback (most recent call last):
File "./procexp.py", line 27, in <module>
from PyQt5 import QtCore, QtGui, QtWidgets, uic
ModuleNotFoundError: No module named 'PyQt5'
htop, however, shows about 30 multitasking processes spawned from main, each using around 2% of a CPU core (varying by the second) at nice 19. At the time of inspection, that is. I'll go away and think about that. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I've one task now that had the same timeout issue getting python. The host was running fine on these tasks before and I don't know what has changed.
I've aborted a couple tasks now that are not making any progress after 20 hours or so and are stuck at 13% completion. Similar series tasks are showing much more progress after only a few minutes. Most complete in 5-6 hours.
I reset the project thinking something got corrupted in the downloaded libraries but that has not fixed anything.
Need to figure out how to debug the tasks on this host. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You might look into schedtool as an alternative. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
I'd like to hear what others are using for ncpus for their Python tasks in their app_config files.
I'm using:
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>5.0</cpu_usage>
</gpu_versions>
</app>
for all my hosts and they seem to like that. Haven't had any issues. Very interesting. Does this actually limit PythonGPU to using at most 5 cpu threads?
Does it work better than:
<app_config>
<!-- i9-7980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<avg_ncpus>5</avg_ncpus>
<cmdline>--nthreads 5</cmdline>
<fraction_done_exact/>
</app>
</app_config>
Edit 1: To answer my own question I changed cpu_usage to 5 and am running a single PythonGPU WU with nothing else going on. The System Monitor shows 5 CPUs are running in the 60 to 80% range with all othe CPU running in the 10 to 40% range.
Is there any way to stop it from taking over ones entire computer?
Edit 2: I turned on WCG and the group of 5 went up to 100% and all the rest went to OPN in the 80 to 95% range. |
|
|
|
No. Setting that value won’t change how much CPU is actually used. It just tells BOINC how much of the CPU is being used so that it can probably account resources.
This app will use 32 threads and there’s nothing you can do in BOINC configuration to change that. This has always been the case though.
____________
|
|
|
|
This morning, in a routine system update, I noticed that BOINC Client / Manager was updated from Version 7.16.17 to Version 7.18.1.
It would be interesting to know if PrivateTmp=true is set as a default at this new version, thus in some way helping for Python GPU task to succeed... |
|
|
|
Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.
I'll check it out in the morning. v7.18.1 should be pretty good (it's been available for Android since August last year), but I don't yet know the answer to your specific question - there hasn't been any chatter about testing or new releases in the usual places. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Which distro/repository are you using? I have Mint with Gianfranco Costamagna's PPA: that's usually the fastest to update, and I see v7.18.1 is being offered there as well - although I haven't installed it yet.
I'll check it out in the morning. v7.18.1 should be pretty good
It bombed out on the Rosetta pythons; they did not run at all (a VBox problem undoubtedly). And it failed all the validations on QuChemPedIA, which does not use VirtualBox on the Linux version. But it works OK on CPDN, WCG/ARP and Einstein/FGRBP (GPU). All were on Ubuntu 20.04.3.
So be prepared to bail out if you have to. |
|
|
|
Which distro/repository are you using?
I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating. |
|
|
|
My PPA gives slightly more information on the available update:
I know that it's auto-generated from the Debian package maintenance sources, which is probably the ultimate source of the Ubuntu LTS package as well. I've had a quick look round, but there's no sign so far that this release was originated by BOINC developers: in particular, no mention was made of it during the BOINC projects conference call on January 14th 2022. I'll keep digging. |
|
|
|
OK, I've taken a deep breath and enough coffee - applied all updates.
WARNING - the BOINC update appears to break things.
The new systemd file, in full, is
[Unit]
Description=Berkeley Open Infrastructure Network Computing Client
Documentation=man:boinc(1)
After=network-online.target
[Service]
Type=simple
ProtectHome=true
ProtectSystem=strict
ProtectControlGroups=true
ReadWritePaths=-/var/lib/boinc -/etc/boinc-client
Nice=10
User=boinc
WorkingDirectory=/var/lib/boinc
ExecStart=/usr/bin/boinc
ExecStop=/usr/bin/boinccmd --quit
ExecReload=/usr/bin/boinccmd --read_cc_config
ExecStopPost=/bin/rm -f lockfile
IOSchedulingClass=idle
# The following options prevent setuid root as they imply NoNewPrivileges=true
# Since Atlas requires setuid root, they break Atlas
# In order to improve security, if you're not using Atlas,
# Add these options to the [Service] section of an override file using
# sudo systemctl edit boinc-client.service
#NoNewPrivileges=true
#ProtectKernelModules=true
#ProtectKernelTunables=true
#RestrictRealtime=true
#RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
#RestrictNamespaces=true
#PrivateUsers=true
#CapabilityBoundingSet=
#MemoryDenyWriteExecute=true
#PrivateTmp=true #Block X11 idle detection
[Install]
WantedBy=multi-user.target
Note the line I've picked out. That starts with a # sign, for comment, so it has no effect: PrivateTmp is undefined in this file.
New work became available just as I was preparing to update, so I downloaded a task and immediately suspended it. After the updates, and enough reboots to get my NVidia drivers functional again (it took three this time), I restarted BOINC and allowed the task to run.
Task 32736884
Our old enemy "INTERNAL ERROR: cannot create temporary directory!" is back. Time for a systemd over-ride file, and to go fishing for another task.
Edit - updated the file, as described in message 58312, and got task 32736938. That seems to be running OK, having passed the 10% danger point. Result will be in sometime after midnight. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I see your task completed normally with the PrivateTmp=true uncommented in the service file.
But is the repeating warning:
wandb: WARNING Path /var/lib/boinc-client/slots/11/.config/wandb/wandb/ wasn't writable, using system temp directory
a normal entry for those using the standard BOINC location installation? |
|
|
|
No, that's the first time I've seen that particular warning. The general structure is right for this machine, but it does't usually reach as high as 11 - GPUGrid normally gets slot 7. Whatever - there were some tasks left waiting after the updates and restarts.
I think this task must have run under a revised version of the app - the next stage in testing. The output is slightly different in other ways, and the task ran for a significantly shorter time than other recent tasks. My other machine, which hasn't been updated yet, got the same warnings in a task running at the same time. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Oh, I was not aware of this warning.
"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.
I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.
Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, this experiments is with a slightly modified version of the algorithm, which should be faster. It runs the same number of interactions with the reinforcement learning environment, so the credits amount is the same.
____________
|
|
|
|
I'll take a look at the contents of the slot directory, next time I see a task running. You're right - the entire '/var/lib/boinc-client/slots/n/...' structure should be writable, to any depth, by any program running under the boinc user account.
How is the '.config/wandb/wandb/' component of the path created? The doubled '/wandb' looks unusual. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The directory paths are defined as environment variables in the python script.
# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.path.join(os.getcwd(), ".config/wandb")
Then the directories are created by the wandb python package (which handles logging of relevant training data). I suspect it could be in the creation that the permissions are defined. So it is not a BOINC problem. I will change the paths in future jobs to:
# Set wandb paths
os.environ["WANDB_CONFIG_DIR"] = os.getcwd()
os.environ["WANDB_DIR"] = os.getcwd()
Note that "os.getcwd()" is the working directory, so "/var/lib/boinc-client/slots/11/" in this case
____________
|
|
|
|
Oh, I was not aware of this warning.
"/var/lib/boinc-client/slots/11/.config/wandb/wandb/" is the directory where the training logs are stored. Yes, it changed in the last batch because of a problem detected earlier, in which the logs were stored in a directory outside boinc-client.
I could actually change it to any other location. I just thought that any location inside "/var/lib/boinc-client/slots/11/" was fine.
Maybe it is just a warning because .config is a hidden directory. I will change it again anyway, so that the logs are stored in "/var/lib/boinc-client/slots/11/" directly. The next batches will still contains the warning, but will disappear for the next experiment.
what happens if that directory doesn't exist? several of us run BOINC in a different location. since it's in /var/lib/ the process wont have permissions to create the directory, unless maybe if BOINC is run as root.
____________
|
|
|
|
'/var/lib/boinc-client/' is the default BOINC data directory for Ubuntu BOINC service (systemd) installations. It most certainly exists, and is writable, on my machine, which is where Keith first noticed the error message in the report of a successful run. During that run, much will have been written to .../slots/11
Since abouh is using code to retrieve the working (i.e. BOINC slot) directory, the correct value should be returned for non-default data locations - otherwise BOINC wouldn't be able to run at all. |
|
|
|
I'm aware it's the default location on YOUR computer, and others running the standard ubuntu repository installer. but the message from abouh sounded like this directory was hard coded since he put the entire path. and for folks running BOINC in another location, this directory will not be the same. if it uses a relative file path, then it's fine, but I was seeking clarification.
/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.
____________
|
|
|
|
Hard path coding was removed before this most recent test batch.
edit - see message 58292: "Should be easy to fix". |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
/var/lib/boinc-client/ does not exist on my system. /var/lib is write protected, creating a directory there requires elevated privileges, which I'm sure happens during install from the repository.
Yes. I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client
I also do these to allow monitoring by BoincTasks over the LAN on my Win10 machine:
• Copy “cc_config.xml” to /etc/boinc-client folder
• Copy “gui_rpc_auth.cfg” to /etc/boinc-client folder
• Reboot
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The directory should be created wherever you run BOINC, that is not a problem.
Inside the /boinc-client directory, but it does not matter if this directory is in /var/lib/ or somewhere else.
____________
|
|
|
|
I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs. |
|
|
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.
Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).
Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
Edit - found the change log, but I'm none the wiser. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I do the following dance whenever setting up BOINC from Ubuntu Software or LocutusOfBorg:
Join the root group: sudo adduser (Username) root
• Join the BOINC group: sudo adduser (Username) boinc
• Allow access by all: sudo chmod -R 777 /etc/boinc-client
• Allow access by all: sudo chmod -R 777 /var/lib/boinc-client By doing so, you nullify your system's security provided by different access rights levels.
This practice should be avoided by all costs.
I am on an isolated network behind a firewall/router. No problem at all. |
|
|
|
I am on an isolated network behind a firewall/router. No problem at all. That qualifies as famous last words.
|
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.
All I know is that the new build does not work at all on Cosmology with VirtualBox 6.1.32. A work unit just suspends immediately on startup.
|
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I am on an isolated network behind a firewall/router. No problem at all. That qualifies as famous last words.
It has lasted for many years.
EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now. |
|
|
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.
Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies. My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)
|
|
|
|
I am on an isolated network behind a firewall/router. No problem at all. That qualifies as famous last words.
It has lasted for many years.
EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now. In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/ |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I am on an isolated network behind a firewall/router. No problem at all. That qualifies as famous last words.
It has lasted for many years.
EDIT: They are all dedicated crunching machines. I have only BOINC and Folding on them. If they are a problem, I should pull out now. In your scenario, it's not a problem.
It's dangerous to suggest that lazy solution to everyone, as their computers could be in a very different scenario.
https://pimylifeup.com/chmod-777/
You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all.
|
|
|
|
You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all. Excuse me?
|
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all. Excuse me?
What comparable isolation do you get in Windows from one program to another?
Or what security are you talking about? Port security from external sources? |
|
|
|
You might as well carry your warnings over to the Windows crowd, which doesn't have much security at all. Excuse me?
What comparable isolation do you get in Windows from one program to another? Security descriptors introduced into the NTFS 1.2 file system released in 1996 with Windows NT 4.0. The access control lists in NTFS are more complex in some aspects than in Linux. All modern Windows use NTFS by default.
User Account Control is introduced in 2007 with Windows Vista (=apps doesn't run as administrator even if the user has administrative privileges until the user elevates it through an annoying popup)
Or what security are you talking about? Port security from external sources? Windows firewall is introced with Windows XP SP2 in 2004.
This is my last post in this thread about (undermining) filesystem security. |
|
|
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.
Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)
Updated my second machine. It appears that this re-release is NOT releated to the systemd problem: the PrivateTmp=true line is still commented out.
Re-apply the fix (#1) from message 58312 after applying this update, if you wish to continue running the Python test apps. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I think you are correct, except in the term "undermining", which is not appropriate for isolated crunching machines. There is a billion-dollar AV industry for Windows. Apparently someone has figured out how to undermine it there. But I agree that no more posts are necessary.
EDIT: I probably should have said that it was only for isolated crunching machines at the outset. If I were running a server, I would do it differently. |
|
|
|
While chmod 777-ing in general is a bad practice. There’s little harm in blowing up the BOINC directory like that. Worst that can happen is you modify or delete a necessary file by accident and break BOINC. Just reinstall and learn the lesson. Not the end of the world in this instance.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.
Saw it when I was coaxing a new ACEMD3 task into life, so I won't know what it contains until tomorrow (unless I sacrifice my second machine, after lunch).
Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies.
Edit - found the change log, but I'm none the wiser.
Ubuntu 20.04.3 LTS is still on the older 7.16.6 version.
apt list boinc-client
Listing... Done
boinc-client/focal 7.16.6+dfsg-1 amd64
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I see that the BOINC PPA distribution has been updated yet again - still v7.18.1, but a new build timed at 17:10 yesterday, 4 February.
Could somebody please check whether the Ubuntu LTS repo has been updated as well? Maybe somebody read my complaint to boinc_alpha after all. Still no replies. My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)
Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version. |
|
|
|
I think they use a different PPA, not the standard Ubuntu version.
____________
|
|
|
|
My Ubuntu 20.04.3 machines updated themselves this morning to v7.18.1 for the 3rd time.
(available version: 7.18.1+dfsg+202202041710~ubuntu20.04.1)
Curious how your Ubuntu release got this newer version. I did a sudo apt update and apt list boinc-client and apt show boinc-client and still come up with older 7.16.6 version. It's from http://ppa.launchpad.net/costamagnagianfranco/boinc/ubuntu
Sorry for the confusion.
|
|
|
|
I think they use a different PPA, not the standard Ubuntu version.
You're right. I've checked, and this is my complete repository listing.
There are new pending updates for BOINC package, but I've recently catched an ACEMD3 ADRIA new task, and I'm not updating until it be finished and reported.
My experience warns that these tasks are highly prone to fail if something is changed while processing. |
|
|
|
Which distro/repository are you using?
I'm using the regular repository for Ubuntu 20.04.3 LTS
I took screenshot of offered updates before updating.
Ah. Your reply here gave me a different impression. Slight egg on face, but both our Linux update manager screenshots fail to give source information in their consolidated update lists. Maybe we should put in a feature request? |
|
|
|
ACEMD3 task finished on my original machine, so I updated BOINC from PPA 2022-01-30 to 2022-02-04.
I can confirm that if you used systemctl/edit to create a separate over-ride file, it remains in place - no need to re-edit every time. If you used a text editor to edit the raw systemd file in place, of course, it'll get over-written and will need editing again.
(final proof-of-the-pudding of that last statement awaits the release of the next test batch) |
|
|
|
Got a new task (task 32738148). Running normally, confirms override to systemd is preserved.
Getting entries in stderr as before:
wandb: WARNING Path /var/lib/boinc-client/slots/7/.config/wandb/wandb/ wasn't writable, using system temp directory
(we're back in slot 7 as usual)
There are six folders created in slot 7:
agent_demos
gpugridpy
int_demos
monitor_logs
python_dependencies
ROMS
There are no hidden folders, and certainly no .config
wandb data is in:
/tmp/systemd-private-f670b90d460b4095a25c37b7348c6b93-boinc-client.service-7Jvpgh/tmp
There are 138 folders in there, including one called simply wandb
wandb contains:
debug-internal.log
debug.log
latest-run
run-20220206_163543-1wmmcgi5
The first two are files, the last two are folders. There is no subfolder called wandb - so no recursion, such as the warning message suggests. Hope that helps. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks! the content of the slot directory is correct.
The wandb directory will be also placed in the slot directory soon, in the next experiment. During the current experiment, which consists of multiple batches of tasks, the wandb directory will be still in /tmp, as a result of the warning.
That is not a problem per se, but I agree that will be cleaner to place it in the slot directory, so all BOINC files are there.
____________
|
|
|
|
wandb: Run data is saved locally in /var/lib/boinc-client/slots/7/wandb/run-20220209_082943-1pdoxrzo |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Great, thanks a lot for the confirmation. So now it seems the directory is appropriate one.
____________
|
|
|
|
Pretty happy to see that my little Quadro K620s could actually handle one of the ABOU work units. Successfully ran one in under 31 hours. It didn't hit the memory too hard, which helps. The K620 has a DDR3 memory bus so the bandwidth is pretty limited.
http://www.gpugrid.net/result.php?resultid=32741283
Though, it did fail one of the Anaconda work units that went out. The error message doesn't mean much to me.
http://www.gpugrid.net/result.php?resultid=32741757
Traceback (most recent call last):
File "run.py", line 40, in <module>
assert os.path.exists('output.coor')
AssertionError
11:22:33 (1966061): ./gpugridpy/bin/python exited; CPU time 0.295254
11:22:33 (1966061): app exit status: 0x1
11:22:33 (1966061): called boinc_finish(195)
|
|
|
|
All tasks goes in errors on this machine : https://www.gpugrid.net/results.php?hostid=591484
I specify that the machine does not have a GPU usable by BOINC.
Thanks for your help. |
|
|
|
I got two of those yesterday as well. They are described as "Anaconda Python 3 Environment v4.01 (mt)" - declared to run as multi-threaded CPU tasks. I do have working GPUs (on host 508381), but I don't think these tasks actually need a GPU.
The task names refer to a different experimenter (RAIMIS) from the ones we've been discussing recently in this thread. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
We were running those kind of tasks a year ago. Looks like the researcher has made an appearance again. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I just downloaded one, but it errored out before I could even catch it starting. It ran for 3 seconds, required four cores of a Ryzen 3950X on Ubuntu 20.04.3, and had an estimated time of 2 days. I think they have some work to do.
http://www.gpugrid.net/result.php?resultid=32742752
PS - It probably does not help that that machine is running BOINC 7.18.1. I have had problems with it before. I will try 7.16.6 later. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
PPS - It ran for two minutes on an equivalent Ryzen 3950X running BOINC 7.16.6, and then errored out. |
|
|
DragoSend message
Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level
Scientific publications
|
I just ran 4 of the Python CPU tasks wu's on my Ryzen 7 5800H, Ubuntu 20.04.3 LTS, 16 GB ram. Each was run on 4 CPU threads at the same time. The first 0,6% took over 10 minutes, then they jumped to 10%, continued a while longer until 17 minutes were over and then erroed out all at more or less the same moment in the task. Here is one example: 32743954 |
|
|
|
A RAIMIS MT task - which accounts for the 4 threads.
And yet -
Run
CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
NVIDIA GeForce RTX 3060 Laptop GPU (4095MB)
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/5/run.py", line 50, in <module>
assert os.path.exists('output.coor')
AssertionError |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I am running two of the Anacondas now. They each reserve four threads, but are apparently only using one of them, since BoincTasks shows 25% CPU usage.
They have been running for two hours, and should complete in 14 hours total, though the estimates are way off and show 12 days. Therefore, they are running high priority even though they should complete with no problem. |
|
|
DragoSend message
Joined: 3 May 20 Posts: 18 Credit: 831,594,060 RAC: 3,662,077 Level
Scientific publications
|
Hey Richard. In how far is my GPU's memory involved in a CPU task? |
|
|
|
Hey Richard. In how far is my GPU's memory involved in a CPU task?
It shouldn't be - that's why I drew attention to it. I think both AbouH and RAIMIS are experimenting with different applications, which exploit
both GPUs and multiple CPUs.
It isn't at all obvious how best to manage a combination like that under BOINC - the BOINC developers only got as far as thinking about either/or, not both together.
So far, Abou seems to have got further down the road, but I'm not sure how much further development is required. We watch and wait, and help where we can. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
My first two Anacondas ended OK after 31 hours. But they were _2 and _3.
I am not sure what the error messages mean. Some ended after a couple of minutes, while others went longer.
http://www.gpugrid.net/results.php?hostid=593715
I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I am running a _4 now. After 18 minutes it is OK, but the CPU usage is still trending down to a single core after starting out high.
It stopped making progress after running for a day and reaching 26% complete, so I aborted it. I will wait until they fix things before jumping in again. But my results were different than the others, so maybe it will do them some good. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone! I am sorry for the late reply.
Now that most of my jobs seem to complete successfully, we decided to remove the "beta" flag from the app. I would like to thank you all for your help during the past months to reach this point. Obviously I will try to solve any further problem detected. In the future we will try to extend it for Windows, but we are not there yet.
Regarding the app requirements, from now on they will be similar to those in my last batches. In reinforcement learning, in general there is no way around the mixed CPU/GPU usage. Most reinforcement learning environments are powered by CPU, but the machine learning algorithms to teach agents to solve the environments use GPU.
RAMIS was experimenting with a different application. But the idea is that another beta app will be created for this purpose.
____________
|
|
|
|
Is this a record?
Initial runtime estimate for:
e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
Python apps for GPU hosts beta v1.00 (cuda1131) for Windows
Task 32766826
Time to lie back and enjoy the popcorn for ... 11½ years ??!!
Edit - 36 minutes to download 2.52 GB, less than a minute to crash. Ah well, back to the drawing board.
08/03/2022 17:57:22 | GPUGRID | Started download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:03 | GPUGRID | Finished download of windows_x86_64__cuda1131.tar.bz2.e9a2e4346c92bfb71fae05c8321e9325
08/03/2022 18:35:26 | GPUGRID | Starting task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | [sched_op] Reason: Unrecoverable error for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5
08/03/2022 18:36:21 | GPUGRID | Computation for task e1a4-ABOU_pythonGPU_beta2_test-0-1-RND8371_5 finished
Edit 2 - "application C:\Windows\System32\tar.exe missing". I can deal with that.
Download from https://sourceforge.net/projects/gnuwin32/files/tar/
NO - that wasn't what it said it was. Looking again. |
|
|
|
No, this isn't working. Apparently, tar.exe is included in Windows 10 - but I'm still running Windows 7/64, and a copy from a W10 machine won't run ("Not a valid Win32 application"). Giving up for tonight - I've got too much waiting to juggle. I'll try again with a clearer head tomorrow. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
Yeah estimates must have astronomical as I am at over 2 months Time left at 3/4 completion on 2 tasks.
11:37 hr:min 79.3% 61d2h
10:04 hr:min 73.9% 77d2h
74.8% dropped on the 2nd task it down to 74d10h. Around 215d initial ETA? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
No need to go back to the drawing board, in principle. Here is what is happening:
1. The PythonGPU app should be stable now and only available for Linux (like until now). Jobs are being sent there and should work normally.
2. A new app, called PythonGPUbeta, has been deployed for both Linux and Windows. The idea is to test now the python jobs for Windows. The source of bugs to solve should be this one now... Ultimately the idea is to have a common PythonGPU for both OS.
3. While PythonGPUbeta accepts Linux and Windows, I expect most errors to come from the Windows part.
Please, let me know if any of the following is not correct.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
In this new version of the app, we send the whole conda environment in a compressed file ONLY ONCE, and unpack it in the machine. The conda environment is what weights around 2.5 GB (depends on whether the machine has cuda10 or cuda11). However, while the environment remains the same there will be no need to re-download it in every job. This is how acemd app works.
We are testing which compression format is best for our purpose. We tested first with a tar.bz2 file. For Linux there was no problem to decompress it.
For windows, I tested locally in a Windows 10 laptop. I could decompress it successfully with tar.exe.
I am not sure what is happening with the estimates, but the estimation is obviously wrong. The test jobs should download the conda environment only in the first job, decompress it and finally run a short python program using CPU and GPU. Are the Linux estimates also so exagerated?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Some problems we are facing are, as Richard mentioned, that before W10 there is no tar.exe.
Also, I have seen some jobs with the following error:
tar.exe: Error opening archive: Can't initialize filter; unable to run program "bzip2 -d"
In theory tar.exe is able to handle bzip2 files. We suspect it could be a problem with PATH env variable (which we will test). Also, tar gz could be a more compatible format for Windows.
____________
|
|
|
|
Don't worry, it's only my own personal drawing board that I'm going back to!
Microsoft has form in this area. I remember buying a commercial copy of WinZip for use with Windows 3 - it arrived by post, on a single floppy disk. Later, they bought the company and incorporated it into Windows. Microsoft tend to do this very late in the day - hence my problems yesterday. I'll have a proper look round later today, and see if I can find a version which handles the bzip2 problem too. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you very much! I will send a small batch of test jobs as soon as I can to check if for windows 10 the bzip2 error is caused by an erroneous PATH variable. And the next step will be trying with tar.gz as mentioned.
____________
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
How about some checkpoints. I have a python task that was nearly completed, a ACEMD4 task downloaded next with like 8 billion days ETA. It interrupted the python task. 14hours of work and it went back to 10%. I only have 0.05 days work queue on that client so the python app was at least 95% complete. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
was it a PythonGPU task for Linux mmonnin? I have checked your recent jobs, seemed to be successful.
PythonGPU task checkpointing was working before. It was discussed previously in the forum. I tested in locally back then and worked fine. Did it happen to anyone else that checkpointing failed? please let me know in that case
I have sent a small batch of tasks for PythonGPUbeta, to test if some errors on Windows are now solved. Will keep iterating in small batches for the beta app.
____________
|
|
|
|
I have a python task for Linux running, recently started.
It's reported that it's checkpointing properly:
CPU time 00:33:10
CPU time since checkpoint 00:01:33
Elapsed time 00:33:27
but that isn't the acid test: the question is whether it can read back the checkpoint data files when restarted.
I'll pause it after a checkpoint, let the machine finish the last 20 minutes of the task it booted aside, and see what happens on restart. Sometimes BOINC takes a little while to update progress after a pause - you have to watch it, not just take the first figure you see.
Results will be reported in task 32773760 overnight, but I'll post here before that.
Edit - looks good so far: restart.chk, progress, run.log all present with a good timestamp. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Perfect thanks! That it takes a little while to update progress after a pause, can happen.
The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.
After resuming, the script looks for these progress and checkpoint files to continue counting from there.
However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.
____________
|
|
|
|
However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.
Well, it was the only one I had in a suitable state for testing.
And it's a good thing we checked. It appears that ACEMD4 in its current state (v1.03) does NOT handle checkpointing correctly. I suspended it manually at just after 10% complete: on restart, it wound back to 1% and started counting again from there. It's reached 2.980% as I type - four increments of 0.495.
The run.log file (which we don't normally get a chance to see) has the ominous line
# WARNING: removed an old file: output.xtc
after a second set of startup details. Perhaps you could pass a message to the appropriate team? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I will. Thanks a lot for the feedback.
____________
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
Perfect thanks! That it takes a little while to update progress after a pause, can happen.
The pythonGPU tasks progress is defined by a target number of interactions between the AI agent and the environment in which it is trained. Generally 25M interactions per job. I generate checkpoints regularly and create a progress file that tracks how many of these interactions have been already executed.
After resuming, the script looks for these progress and checkpoint files to continue counting from there.
However, Richard note that the result you linked is not PythonGPU but ACEMD 4. I am not sure how these do the checkpointing.
Yes it was linux.
The % complete I saw was 100%, then a bit later 10% per BOINCTasks.
Looking at the history on that PC it finished in 14:14 run time, just 11 minutes after the ACEMD4 tasks so it looks like it resumed properly. Thanks for checking. |
|
|
|
OK, back on topic. Another of my Windows 7 machines has been allocated a genuine ABOU_pythonGPU_beta2 task (task 32779476), and I was able to suspend it before it even tried to run. I've been able to copy all the downloaded files into a sandbox to play with.
The first task is:
<task>
<application>C:\Windows\System32\tar.exe</application>
<command_line>-xvf windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>
You don't need both a path statement and a a hard-coded executable location. That may fail on a machine with non-standard drive assignments.
It will certainly fail on this machine, because I still haven't been able to locate a viable tar.exe for Windows 7 (the Windows 10 executable won't run under Windows 7 - at least, I haven't found a way to make it run yet).
I (and many other volunteers here) do have a freeware application called 7-Zip, and I've seen a suggestion that this may be able to handle the required decompression. I'll test that offline first, and if it works, I'll try to modify the job.xml file to use that instead. That's not a complete solution, of course, but it might give a pointer to the way forward. |
|
|
|
OK, that works in principle. The 2.48 GB gz download decompresses to a single 4.91 GB tar file, and that in turn unpacks to 13,449 files in 632 folders. 7-Zip can handle both operations.
ToDo: go find the command line I saw yesterday for doing that in a script.
Check the disk usage limits to ensure all that can happen in the slot directory. |
|
|
|
And it's worth a try. I'm going to split that task into two:
<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar.gz</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>
<task>
<application>"C:\Program Files\7-Zip\7z"</application>
<command_line>x windows_x86_64__cuda1131.tar</command_line>
<setenv>PATH=C:\Windows\system32;C:\Windows</setenv>
</task>
I could have piped them, but - baby steps!
I'm going to need to increase the disk allowance: 10 (decimal) GB isn't enough. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder.
This morning I noticed a task running for 6.5 hours with no progress, no CPU usage.
https://www.gpugrid.net/result.php?resultid=32778132 |
|
|
|
Damn. Where did that go wrong?
application C:\Windows\System32\tar.exe missing
Anyone else who wants to try this experiment can try https://www.7-zip.org/ - looks as if the license would even allow the project to distribute it.
Edit - I edited the job.xml file while the previous task was finishing, and then stopped BOINC to increase the disk limit. On restart, BOINC must have noticed that the file had changed, and it downloaded a fresh copy. Near miss. |
|
|
|
application "C:\Program Files\7-Zip\7z" missing
Make that "C:\Program Files\7-Zip\7z.exe"
Or maybe not.
application "C:\Program Files\7-Zip\7z.exe" missing
Isn't the damn wrapper clever enough to remove the quotes I put in there to protect the space in "Program Files"? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Using tar.exe in W10 and W11 seems to work now.
However, it is true that:
a) some machines do not have tar.exe. My initial idea was that older versions of Windows could donwload tar.exe, but it seems that is does not work.
b) The C:\Windows\System32\tar.exe path is hardcoded. I understand that ideally we should add to PATH all possible paths where this executable could be found right?
____________
|
|
|
|
On this particular Windows 7 machine, I have:
PATH=
C:\Windows\system32;
C:\Windows;
C:\Windows\System32\Wbem;
C:\Windows\System32\WindowsPowerShell\v1.0\;;
C:\Program Files\Process Lasso\;
- I've split that into separate lines for clarity. but it's one single environment variable that has been added to by various installers over the years.
For a native Windows system component, I wouldn't have thought a path was necessary at all - Windows should handle all that. That's what path variables are for. But maybe the wrapper app is so dumb that it just throws the exact string it parses from job.xml at a file_open function? I'll have a look at the code.
I've got two remaining thoughts left: try Program [space] Files without any quotes; or stick a copy of 7z.exe in Windows/system32 (although mine's a 64-bit version...), and call it explicitly from there. I don't think it'll have anywhere to hide from that... |
|
|
|
Yay! That's what I wanted to see:
17:49:09 (21360): wrapper: running C:\Program Files\7-Zip\7z.exe (x windows_x86_64__cuda1131.tar.gz)
7-Zip [64] 15.14 : Copyright (c) 1999-2015 Igor Pavlov : 2015-12-31
Scanning the drive for archives:
1 file, 2666937516 bytes (2544 MiB)
Extracting archive: windows_x86_64__cuda1131.tar.gz
And I've got v1.04 in my sandbox... |
|
|
|
But not much more than that. After half an hour, it's got as far as:
Everything is Ok
Files: 13722
Size: 5270733721
Compressed: 5281648640
18:02:00 (21360): C:\Program Files\7-Zip\7z.exe exited; CPU time 6.567642
18:02:00 (21360): wrapper: running python.exe (run.py)
WARNING: The script shortuuid.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script normalizer.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts wandb.exe and wb.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
pytest 0.0.0 requires atomicwrites>=1.0, which is not installed.
pytest 0.0.0 requires attrs>=17.4.0, which is not installed.
pytest 0.0.0 requires iniconfig, which is not installed.
pytest 0.0.0 requires packaging, which is not installed.
pytest 0.0.0 requires py>=1.8.2, which is not installed.
pytest 0.0.0 requires toml, which is not installed.
aiohttp 3.7.4.post0 requires attrs>=17.3.0, which is not installed.
WARNING: The scripts pyrsa-decrypt.exe, pyrsa-encrypt.exe, pyrsa-keygen.exe, pyrsa-priv2pub.exe, pyrsa-sign.exe and pyrsa-verify.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script jsonschema.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script gpustat.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts ray-operator.exe, ray.exe, rllib.exe, serve.exe and tune.exe are installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
pytest 0.0.0 requires atomicwrites>=1.0, which is not installed.
pytest 0.0.0 requires iniconfig, which is not installed.
pytest 0.0.0 requires py>=1.8.2, which is not installed.
pytest 0.0.0 requires toml, which is not installed.
WARNING: The script f2py.exe is installed in 'D:\BOINCdata\slots\5\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
wandb: W&B API key is configured (use `wandb login --relogin` to force relogin)
wandb: Appending key for api.wandb.ai to your netrc file: D:\BOINCdata\slots\5/.netrc
wandb: Currently logged in as: rl-team-upf (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.11
wandb: Run data is saved locally in D:\BOINCdata\slots\5\wandb\run-20220310_181709-mxbeog6d
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run MontezumaAgent_e1a12
wandb: View project at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta
wandb: View run at https://wandb.ai/rl-team-upf/MontezumaRevenge_rnd_ppo_cnn_nophi_baseline_beta/runs/mxbeog6d
and doesn't seem to be getting any further. I'll see if it's moved on after dinner, might might abort it if it hasn't.
Task is 32782603 |
|
|
|
Then, lots of iterations of:
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\cudnn_cnn_train64_8.dll" or one of its dependencies.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 114, in _main
prepare(preparation_data)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 225, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
run_name="__mp_main__")
File "D:\BOINCdata\slots\5\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "D:\BOINCdata\slots\5\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "D:\BOINCdata\slots\5\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\BOINCdata\slots\5\run.py", line 23, in <module>
import torch
File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
raise err
I've increased it ten-fold, but that requires a reboot - and the task didn't survive. Trying one last time, then it's 'No new Tasks' for tonight. |
|
|
|
BTW, yes - the wrapper really is that dumb.
https://github.com/BOINC/boinc/blob/master/samples/wrapper/wrapper.cpp#L727
It just plods along, from beginning to end, copying it byte by byte. The only thing it considers is which way the slashes are pointing. |
|
|
|
I managed to complete 2 of these WUs successfully. They still need a lot of work done. You have low GPU usage, and they cause the boinc manager to be slow and sluggish and unresponsive.
https://www.gpugrid.net/result.php?resultid=32784274
https://www.gpugrid.net/result.php?resultid=32783598
They were pain to finish!!!!!
And what for, only 3000 points for 882 days worth of work per WU!!!!!!
|
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
I had a W10 PC without tar.exe. I noticed the error in a task and copied the exe to system32 folder.
This morning I noticed a task running for 6.5 hours with no progress, no CPU usage.
https://www.gpugrid.net/result.php?resultid=32778132
Disabling python beta on this W10 PC has another 11+ hours gone
https://www.gpugrid.net/result.php?resultid=32780319 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, regarding the workload, I have been testing the tasks with low GPU/CPU usage. I was interested in checking if the conda environment was successfully unpacked and the python script was able to complete a few iterations. It will be increased as soon as this part works, as well as the points.
For the completely wrong duration estimation, I will look into what can be done. I am not sure how BOINC estimates it. Could please someone confirm if it is also wrong in Linux of if it is only a Windows issue?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter?
____________
|
|
|
|
Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.
I was a bit suspicious about the 'paging file too small' error - I didn't even think Windows applications could get information about what the current setting was. I'd suggest correlating the machines with this error, with their reported physical memory. Mine is 'only' 8 GB - small by modern standards.
It looks like there may be some useful clues in
https://discuss.pytorch.org/t/winerror-1455-the-paging-file-is-too-small-for-this-operation-to-complete/131233 |
|
|
|
Could the astronomical time estimations be simply due to a wrong configuration of the rsc_fpops_est parameter?
That's certainly a part of it, but it's a very long, complicated, and historical story. It will affect any and all platforms, not just Windows, and other data as well as rsc_fpops_est. And it's also related to historical decisions by both BOINC and GPUGrid.
I'll try and write up some bedtime reading for you, but don't waste time on it in the meantime - there won't be an easy 'magic bullet' to fix it. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes I was looking at the same link. Seems related to limited memory. I might try to run the suggested script before running the job, which seems to mitigate the problem.
____________
|
|
|
|
Runtime estimation – and where it goes wrong
The estimates we see on our home computers are made up of three elements. They are:
The SIZE of a task – rsc_fpops_est
The SPEED of the device that’s calculating the result
One or more correction tweaks, designed to smooth off the rough edges.
The original system
In the early days, all BOINC projects ran on CPUs, and almost all the CPUs in use were single-core. The speed of that CPU was measured by a derivation of the Whetstone benchmark: this was originally designed to measure hardware speeds only, and deliberately excluded software optimisation. For scientific research, careful optimisation is a valid technique (provided it isn’t done at the expense of accuracy).
There was a general (but unspoken) assumption that projects would be running a single type of research task, using a single application. So the rough edges were smoothed by something called DCF (duration correction factor). That kept track of that single application, running on that single CPU, and gently adjusted it until the estimates were pretty good. It worked. The adjustments were calculated by, and stored on, the local computer.
The revised system
Starting in 2008, BOINC was adapted to support applications that ran on GPUs – GPUGrid and SETI@home first, others followed. There never was any attempt to benchmark GPUs, so the theoretical baseline speed of a GPU application was taken to be a figure derived from the hardware architecture, notably the number of shaders and the clock speed. This was known as “peak FLOPS”, or – to some of us – “marketing FLOPS”. No way has any programmer ever been able to write a scientific program which uses every clock cycle of every shader, with no overhead for synchronisation or data transfer. Whatever.
At the same time, projects kept their CPU apps running, and many developed multiple research streams using different apps. A single-valued DCF couldn’t smooth off all the different rough edges at the same time.
There’s nothing in principle to stop the BOINC client keeping track of multiple application+device combinations, and such a system was in fact developed by a volunteer. But it was rejected by David Anderson in Berkeley, who devised his own system of Runtime Estimation, keeping track of the necessary tweaks on the project server. This was intended to replace client-based DCFs entirely, although the old system was retained for historical compatibility.
The implications for GPUGrid
As I think we all know, GPUGrid uses rsc_fpops_est, but I don’t think it’s realised quite how fundamental it is to the whole inverted pyramid. If tasks run much faster than their declared fpops, the only conclusion that BOINC can draw is the application speed has suddenly become much faster, and it tries to adapt accordingly.
GPUGrid has kept both of the adjustment methods active. If you look at any of our computer details, you will see that it contains a link to show application details: the smoothed average of all our successful tasks with each application. The critical one here is APR, or ‘average processing rate’. That’s the device+application speed, in GFlops. But on the computer details page, you’ll also see the DCF listed. Nominally, this should be 1, replaced by APR – but here, usually it isn’t.
The implications? APR works adequately for long term, steady, production work. But it fails during periods of rapid change and testing.
1) APR is disregarded entirely when a new application version is activated on the server. It starts again from scratch, and the initial estimates are – questionable. In fact, I don’t have a clue what speed is assumed for the first few tasks allocated.
2) It kicks in in two stages. First, when 100 tasks have been completed for the whole ensemble, and again when each individual computer reaches 11 completed tasks. Note that ‘completed’ here means a normal end-of-run plus a validated result. Some app versions never achieve that!
Different GPUs run at very different speeds, and the first 100 tasks returned normally come back from the fastest cards. That skews the average speed. In the worst case, the first hundred back can set a standard which lesser cards can’t attain – so they are stopped by ‘run time exceeded’, can never achieve the necessary 11 validations to set their own, lower, bar, and are excluded for good. The same can happen if deliberately short test tasks are put through early on, without an adjusted rsc_fpops_est: again, an unfeasibly fast target is set, and no-one can complete full-length tasks.
Sorry – I’ve been called out this afternoon, so I’ve dashed that off much quicker than I intended. I’ll leave it there for now, and we can all discuss the way forward later.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you very much for the explanation Richard, very helpful actually.
I have been using short tests tasks to catch bugs in the early states of the job. That might have caused problems, although I guess we can adjust rsc_fpops_est and reset statistics later. The idea is to have long term, steady, production work after the tests.
However, I don't fully understand how that could cause estimates of hundreds of days. In any case, the most reliable information for the host is then the progress percentage, which should be correct.
I remember the ‘run time exceeded’ error was happening previously in the app and we had to adjust the rsc_fpops_est parameter. Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? The idea is that PythonGPUbeta eventually becomes the sole Python app, running the same Linux jobs PythonGPU is running now plus Windows jobs.
____________
|
|
|
|
Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? This approach is wrong.
The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app.
As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them. |
|
|
|
Maybe a temporary solution for the time estimation would be to set rsc_fpops_est for the PythonGPUbeta app to the same value we have in the PythonGPU app? This approach is wrong.
The rsc_fpops_est should be set accprdingly for the actual batch of workunits, not for the app.
As test batches are much shorter than production batches, they should have a much less rsc_fpops_est value, regardless that the same app processes them.
Correct.
Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from.
In the meantime, we're working through a glut of ACEMD3 tasks, and here's how they arrive:
12/03/2022 08:23:29 | GPUGRID | [sched_op] NVIDIA GPU work request: 11906.64 seconds; 0.00 devices
12/03/2022 08:23:30 | GPUGRID | Scheduler request completed: got 2 new tasks
12/03/2022 08:23:30 | GPUGRID | [sched_op] estimated total NVIDIA GPU task duration: 306007 seconds
So, I'm asking for a few hours of work, and getting several days. Or so BOINC says.
This is Windows host 45218, which is currently showing "Task duration correction factor 13.714405". (It was higher a few minutes ago, when that work was fetched - over 13.84)
I forgot to mention yesterday that in the first phase of BOINC's life, both your server and our clients took account of DCF, so the 'request' and 'estimated' figures would have been much closer. But when the APR code was added in 2010, the DCF code was removed from the servers. So your server knows what my DCF is, but it doesn't use that information.
So the server probably assessed that each task would last about 11,055 seconds. That's why it added the second task to the allocation: it thought the first one didn't quite fill my request for 11,906 seconds.
In reality, this is a short-running batch - although not marked as such - and the last one finished in 4,289 seconds. That's why DCF is falling after every task, though slowly. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
Yes, I have seen this error in some other machines that could unpack the file with tar.exe. In just a few of them. So it is an issue in the python script. Today I will be looking into it. It does not happen in linux with the same code.
Having tar.exe wasn't enough. I later saw a popup in W10 saying archieveint.dll was missing.
I had two python tasks in linux error out in ~30min with
15:33:14 (26820): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing
That PC has python 2.7.17 and 3.6.8 installed. |
|
|
|
Next time I see a really gross (multi-year) runtime estimate, I'll dig out the exact figures, show you the working-out, and try to analyse where they've come from.
Caught one!
Task e1a5-ABOU_pythonGPU_beta2_test16-0-1-RND7314_1
Host is 43404. Windows 7. It has two GPUs, and GPUGrid is set to run on the other one, not as shown. The important bits are
CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 472.12, CUDA version 11.4, compute capability 7.5, 4096MB, 3032MB available, 5622 GFLOPS peak)
DCF is 8.882342, and the task shows up as:
Why? This is what I got from the server, in the sched_reply file:
<app_version>
<app_name>PythonGPUbeta</app_name>
<version_num>104</version_num>
...
<flops>47361236228.648697</flops>
...
<workunit>
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>
...
1,000,000,000,000,000,000 fpops, at 47 GFLOPS, would take 21,114,313 seconds, or 244 days. Multiply in the DCF, and you get the 2170 days shown.
According to the application details page, this host has completed one 'Python apps for GPU hosts beta 1.04 windows_x86_64 (cuda1131)' task (new apps always go right down to the bottom of that page). It recorded an APR of 1279539, which is bonkers the other way - these are GFlops, remember. It must have been task 32782603, which completed in 781 seconds.
So, lessons to be learned:
1) A shortened test task, described as running for the full-run number of fpops, will register an astronomical speed. If anyone completes 11 tasks like that, that speed will get locked into the system for that host, and will cause the 'runtime limit exceeded' error.
2) BOINC is extremely bad - stupidly bad - at generating a first guess for the speed of a 'new application, new host' combination. It's actually taken precisely one-tenth of the speed of the acemd3 application on this machine, which might be taken as a "safe working assumption" for the time being. I'll try to check that in the server code.
Oooh - I've let it run, and BOINC has remembered how I set up 7-Zip decompression last week. That's nice. |
|
|
|
But it hasn't remembered the increased disk limit. Never mind - nor did I. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Right now, the way PythonGPU app works is by dividing the job in 2 subtasks:
1- first, installing conda and creating the conda environment.
2- second, running the python script.
The error
15:33:14 (26820): task /usr/bin/flock reached time limit 1800
application ./gpugridpy/bin/python missing
means that after 1800 seconds, the conda environment was not yet created for some reason. This could be because the conda dependencies could not be downloaded in time or because the machine was running the installation process more slowly than expected. We set this time limit of 30 mins because in theory it is plenty of time to create the environment.
However, in the new version (the current PythonGPUBeta), we send the whole conda environment compressed and simply unpack it in the machine. Therefor this error, which indeed happens every now and then now, should disappear.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
ok, so my plan was to run at least a few more batches of test jobs. Then start the real tasks.
I understand now that if some machines have by then run several test tasks that will create an estimation problem. Does resetting the credit statistics help? Would it be better to create a new app for real jobs once the testing is finished? so statistics are consistent and, in the long term, BOINC estimates better the durations?
____________
|
|
|
|
My gut feeling is that it would be better to deploy the finished app (after all testing seems to be complete) as a new app_version. We would have to go through the training process for APR one last time, but then it should settle down.
I've seen the reference to resetting the credit statistics before, but only some years ago in scanning the documentation. I've never actually seen the console screen you use to control a BOINC server, let alone operated one for real, so I don't know whether you can control the reset to a single app_version, or whether you have to nuke the entire project - best not to find out the hard way.
You're right, of course - the whole Runtime Estimation (APR) structure is intimately bound up with the CreditNew tools, also introduced in 2010. So the credit reset is likely to include an APR reset - but I'd hold that back for now.
I see you've started sending out v1.05 betas. One has arrived on one of my Linux machines, and again, the estimated speed is exactly one-tenth of the acemd3 speed - with extreme precision, to the last decimal place:
<flops>707593666701.291382</flops>
<flops>70759366670.129135</flops>
That must be deliberate. |
|
|
|
Would it be better to create a new app for real jobs once the testing is finished? Based on the last few days' discussion here, I've understood the purpose of the former short and long queue from GPUGrid's perspective:
By separating the tasks into two queues based on their length, the project's staff didn't have to bother setting the rsc_fpops_est value for each and every batch, (note that the same app was assigned to each queue). The two queues had used different (but constant through batches) rsc_fpops_est values, so the runtime estimation of BOINC could not get so much off in each queue that would tigger the "won't finish on time" or the "run time exceeded" situation.
Perhaps this practise should be put in operation again, even on a finer level of granularity (S, M, L tasks, or even XS and XL tasks). |
|
|
|
I am getting "Disk usage limit exceeded" error.
https://www.gpugrid.net/result.php?resultid=32808038
I do have 400 Gigs reserved for boincs.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I believe the "Disk usage limit exceeded" error is not related to the machine resources, is defined by an adjustable parameter of the app. The conda environment + all the other files might be over this limit.I will review the current value, we might have to increase it. Thanks for pointing it out!
____________
|
|
|
|
After a day out running a long acemd3 task, there's good news and bad news.
The good news: runtime estimates have reached sanity, The magic numbers are now
<flops>336636264786015.625000</flops>
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>
That ends up with an estimated runtime of about 9 hours - but at the cost of a speed estimate of 336,636 GFlops. That's way beyond a marketing department's dream.
Either somebody has done open-heart surgery on the project's database (unlikely and unwise), or BOINC now has enough completed tasks for v1.05 to start taking notice of the reported values.
The bad news: I'm getting errors again.
ModuleNotFoundError: No module named 'gym' |
|
|
|
v1.06 is released and working (very short test tasks only).
Watch out for:
Another 2.46 GB download
Estimates are back up to multiple years |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The latest version should fix this error.
ModuleNotFoundError: No module named 'gym'
____________
|
|
|
|
I have task 32836015 running - showing 50% after 30 minutes. That looks like it's giving the maths a good work-out.
Edit - actually, it's not doing much at all.
You should be on NVidia device 1 - but cool, low power, 0% usage. No checkpoint, nothing written to stderr.txt in an hour and a half. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
For now I am just trying to see the jobs finish.. I am not even trying to make them run for a long time. Jobs should not even need checkpoints, should last less than 15 mins.
So weird, some other jobs in Widows machines from the same batch managed to finish. For example those with result ids 32835825, 32836020 or 32835934.
I don't understand why it works in some Windows machines and fails in others. Sometimes without complaining about anything. And works fine locally in my Windows laptop.
Does windows have trouble with multiprocessing? I need to add many more checkpoints in the scripts I guess. Pretty much after every line of code..
____________
|
|
|
|
Err, this particular task is running on Linux - specifically, Mint v20.3
It ran the first short task OK at lunchtime - see Python apps for GPU hosts beta on host 508381. I think I'd better abort it while we think. |
|
|
kksplaceSend message
Joined: 4 Mar 18 Posts: 53 Credit: 2,591,271,749 RAC: 6,720,230 Level
Scientific publications
|
This task https://www.gpugrid.net/result.php?resultid=32841161 has been running for nearly 26 hours now. It is the first Python beta task I have received that appears to be working. Green-With-Envy shows intermittent low activity on my 1080 GPU and BoincTasks shows 100% CPU usage. It checkpointed only once several minutes after it started and has shown 50% complete ever since.
Should I let this task continue or abort it?
(Linux Mint, 1080 driver is 510.47.03) |
|
|
|
Sounds just like mine, including the 100% CPU usage - that'll be the wrapper app, rather than the main Python app.
One thing I didn't try, but only thought about afterwards, is to suspend the task for a moment and then allow it to run again. That has re-vitalised some apps at other projects, but is not guaranteed to improve things: it might even cause it to fail. But if it goes back to 0% or 50%, and doesn't move further, it's probably not going anywhere. I'd abort it at that point. |
|
|
kksplaceSend message
Joined: 4 Mar 18 Posts: 53 Credit: 2,591,271,749 RAC: 6,720,230 Level
Scientific publications
|
Well, after a suspend and allowing it to run, it went back to its checkpoint and has shown no progress since. I will abort it. Keep on learning.... |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
ok so it gets stuck at 50%. I will be reviewing it today. Thanks for the feedback.
I also seems to fail in most Windows cases without reporting any error.
____________
|
|
|
|
Got a new one - the other Linux machine, but very similar. Looks like you've put some debug text into stderr.txt:
12:28:16 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper (7.7.26016): starting
12:28:17 (482274): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
12:31:39 (482274): /bin/tar exited; CPU time 192.149659
12:31:39 (482274): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Sanity check, make sure that logging matches execution
Check if this is a restarted job
Define Train Vector of Envs
Define RL training algorithm
Look for available model checkpoint in log_dir - node failure case
Define RL Policy
Define rollouts storage
Define scheme
but nothing new has been added in the last five minutes. Showing 50% progress, no GPU activity. I'll give it another ten minutes or so, then try stop-start and abort if nothing new.
Edit - no, no progress. Same result on two further tasks. All the quoted lines are written within about 5 seconds, then nothing. I'll let the machine do something else while I go shopping...
Tasks for host 132158 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Ok so I have seen 3 main errors in the last batches:
1. The one reported by Bedrich Hajek ("Disk usage limit exceeded"). We have now increased the amount of disk space allotted by BOINC to each task and I believe, based on the last batch I sent, that this error is gone now.
2. The "older" Windows machines do not have the tar.exe application and therefore can not unpack the conda environment. I know Richard did some research into that, but had to download 7-Zip. Ideally I would like the app to be self-contained. Maybe we can send the 7-Zip program with the app, I will have to research if that is possible.
3. The job getting stuck at 50%. I did add some debug messages in the last batches and I believe I know more or less when in the code the script gets stuck. I am still looking into it. Will also check recent results to see if there is any pattern when this error happens. Note there there is no checkpoint because it is a short task that gets stuck, so since the training is not progressing new checkpoints are not getting saved.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We have updated to a new app version for windows that solves the following error:
application C:\Windows\System32\tar.exe missing
Now we send the 7z.exe (576 KB) file with the app, which allows to unpack the other files without relying on the host machine having tar.exe (which is only in windows 11 and latest builds of windows 10).
I just sent a small batch of short tasks this morning to test and so far it seems to work.
____________
|
|
|
|
Task 32868822 (Linux Mint GPU beta)
Still seems to be stalling at 50%, after "Define scheme". bin/python run.py is using 100% CPU, plus over 30 threads from multiprocessing.spawn with too little CPU usage to monitor (shows as 0.0%). No GPU app listed by nvidia-smi. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Do you know by chance if this same machine works fine with PythonGPU tasks even if it fails in the PythonGPUBeta ones?
____________
|
|
|
|
Yes, it does. Most recent was:
e1a5-ABOU_rnd_ppod_avoid_cnn13-0-1-RND6436_3
Three failed before me, but mine was OK.
Edit: In relation to that successful task, BOINC only returns the last 64 KB of stderr.txt - so that result starts in the middle of the file (that's the bit that's most likely to contain debug information after a crash). I'll try to capture the initial part of the file next time I run one of those tasks, for reference. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I have also changed a bit the approach.
I have just sent a batch of short tasks much more similar to those in PythonGPU. If these work fine, I will slowly introduce changes to see what was the problem.
____________
|
|
|
|
I've grabbed one. Will run within the hour. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I sent 2 batches,
ABOU_rnd_ppod_avoid_cnn_testing
and
ABOU_rnd_ppod_avoid_cnn_testing2
Unfortunately the first batch will crash. I detected one bug already which I have fixed in the second one. Seems like you got at least one in the second batch ( e1a18-ABOU_rnd_ppod_avoid_cnn_testing2). Running it will give us the info we need.
On the bright side, the fix with 7z.exe seems to work in all machines so far.
____________
|
|
|
|
Yes, I got the testing2. It's been running for about 23 minutes now, but I'm seeing the same as yesterday - nothing written to stderr.txt since:
09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper (7.7.26016): starting
09:29:18 (51456): wrapper: running /bin/tar (xf pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2)
09:32:39 (51456): /bin/tar exited; CPU time 192.380796
09:32:39 (51456): wrapper: running bin/python (run.py)
Starting!!
Finished imports!!
Define rollouts storage
Define scheme
and machine usage shows
(full-screen version of that at https://i.imgur.com/Ly9Aabd.png)
I've preserved the control information for that task, and I'll try to re-run it interactively in terminal later today - you can sometimes catch additional error messages that way. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Ok thanks a lot. Maybe then it is not the python script but some of the dependencies.
____________
|
|
|
|
OK, I've aborted that task to get my GPU back. I'll see what I can pick out of the preserved entrails, and let you know. |
|
|
|
Sorry, ebcak. I copied all the files, but when I came to work on them, several turned out to be BOINC softlinks back to the project directory, where the original file had been deleted. So the fine detail had gone.
Memo to self - don't try to operate dangerous machinery too early in the morning. |
|
|
mmonninSend message
Joined: 2 Jul 16 Posts: 337 Credit: 7,617,724,223 RAC: 11,001,670 Level
Scientific publications
|
The past several tasks have gotten stuck at 50% for me as well. Today one has made it past to 57.7% now in 8hours. 1-2% GPU util on 3070Ti. 2.5 CPU threads per BOINCTasks. 3063mb memory per nvidia-smi and 4.4GB per BOINCTasks. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I updated the app. Tested it locally and works fine on Linux.
I sent a batch of test jobs (ABOU_rnd_ppod_avoid_cnn_testing3), which I have seen executed successfully in at least 1 Linux machine so far.
One way check if the job is actually progressing, is to look for a directory called "monitor_logs/train" in the BOINC directory where the job is being executed. If logs are being written to the files inside this folder, means the task is progressing.
____________
|
|
|
|
Got a couple on one of my Windows 7 machines. The first - task 32875836 - completed successfully, the second is running now. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...
____________
|
|
|
|
nice to hear it! lets see what happens on linux.. so weird if it only works in some machines and gets stuck in others...
Worse is to follow, I'm afraid. task 32875988 started immediately after the first one (same machine, but a different slot directory), but seems to have got stuck.
I now seem to have two separate slot directories:
Slot 0, where the original task ran. It has 31 items (3 folders, 28 files) at the top level, but the folder properties says the total (presumably expanding the site-packages) is 49 folders, 257 files, 3.62 GB
Slot 5, allocated to the new task. It has 93 items at the top level (12 folders, including monitor_logs, and the rest files). This one looks the same as the first one did, while it was actively running the first task. This one has 14 files in the train directory - I think the first only had 4. This slot also has a stderr file, which ends with multiple repetitions of
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "D:\BOINCdata\slots\5\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\__init__.py", line 1, in <module>
from pytorchrl.agent.env.vec_env import VecEnv
File "D:\BOINCdata\slots\5\lib\site-packages\pytorchrl\agent\env\vec_env.py", line 1, in <module>
import torch
File "D:\BOINCdata\slots\5\lib\site-packages\torch\__init__.py", line 126, in <module>
raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "D:\BOINCdata\slots\5\lib\site-packages\torch\lib\shm.dll" or one of its dependencies.
I'm going to try variations on a theme of
- clear the old slot manually
- pause and restart the task
- stop and restart BOINC
- stop and retsart Windows
I'll report back what works and what doesn't. |
|
|
|
Well, that was interesting. The files in slot 0 couldn't be deleted - they were locked by a running app 'python' - which is presumably why BOINC hadn't cleaned the folder when the first task finished.
So I stopped the second task, and used Windows Task Manager to see what was running. Sure enough, there was still a Python image, and I still couldn't delete the old files. So I force-stopped that python image, and then I could - and did - delete them.
I restarted the second task, but nothing much happened. The wrapper app posted in stderr that it was restarting python, but nothing else.
So then I restarted BOINC, and all hell broke loose. In quick succession, I got
Then windows crashed a browser tab and two Einstein@Home tasks on the other GPU.
When I'd closed the Python app from the Windows error box, the BOINC task closed cleanly, uploaded some files, and reported a successful finish. It even validated!
Things all seem to be running quietly now, so I think I'll leave this machine alone for a while and think. At the moment, the take-home theory is that the whole sequence was triggered by the failure of the python app to close at the end of the first task's run. That might be the next thing to look at. |
|
|
|
Well this beta WU was a weird one:
https://www.gpugrid.net/workunit.php?wuid=27211744
It ran to 50% completion and hung there for 3.5 days so I aborted it. Boinc properties showed it running slot 10 except slot 10 was empty. Top (Fedora35) showed no activity with any GPUGrid WU. Some wrapper or something must have been kept alive and running in the background when the WU quit because the ET counter was incrementing time normally. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Interesting that sometimes jobs work and sometimes get stuck in the same machine.
It also seems to me, based on you info, that something remains running at the end of the job and causes the next job to get stuck. Presumably some python thread.
I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.
Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.
____________
|
|
|
|
I've just had task 32876361 fail on a different, but identical, Windows machine. This time, it seems to be explicitly, and simply, a "not enough memory" error - these machines only have 8 GB, which was fine when I bought them. I've suspended the beta programme for the time being, and I'll try to upgrade them. |
|
|
|
Another "Disk usage limit exceeded" error:
https://www.gpugrid.net/result.php?resultid=32876568
And a successful one yesterday:
https://www.gpugrid.net/result.php?resultid=32876288
|
|
|
roundup Send message
Joined: 11 May 10 Posts: 63 Credit: 9,096,655,193 RAC: 54,351,602 Level
Scientific publications
|
After having some errors with recent python app betas, task 32876819 ran without error on a RTX3070 Mobile under Win 11.
A few observations:
- GPU load only between 4% and 8% with a peak between 50% and 70% every 12 seconds.
- The indicated time remaining in the BOINC Client was way off. It started with >7000 (seven thousand) days.
- 15.000 BOINC credits for 102,296 sec runtime. I assume that will be corrected once the python app is going produtive. EDIT: This runtime indicated on the GPUGrid site is not correct, it was actually less. |
|
|
|
These tasks seem to run much better on my machines if I allocate 6 CPU's (threads) to each task. I managed to run one by itself and watched the performance monitor for CPU usage. During the initiation phase (about 5 minutes), the task used ~6 CPU's (threads). After the initiation phase, the CPU usage was in an oscillating pattern that was between ~2 and ~5 threads. Task ran very quickly and has been validated. Please let me know if you have questions. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks a lot for the feedback:
- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct.
- Incorrect time remaining prediction is an issue... it will only be fixed with time once the tasks become stable in duration.. maybe even will be required to create a new app and use this one only to debug.
- Also credits will be corrected yes, for now we will have something similar to what we have in the PythonGPU app.
Starting today I will start sending longer jobs, instead of the super short test jobs I was using just to test the code was working in all OS's and machines.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Last batches seem to be working successfully both in Linux and Windows, and also for GPUs with cuda 10 and cuda 11.
My main worry now is whether or not the problem of some jobs getting "stuck" and never being completed persists. It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.
Please let me know if you detect this problem in one of your tasks, that would be very helpful!
Incidentally, once the PythonGPUBeta app is stable enough, will replace the current PythonGPU app, which only works for Linux.
____________
|
|
|
|
It was reported that the reason was that the Python was not finishing correctly between jobs so I added a few changes in the code to try to solve this issue.
Well, that was one report of one task on one machine with limited memory. It seemed be a case that, if it happened, caused problems for the following task. It's certainly worth looking at, and if it prevents some tasks failing - great. But I'd be cautious about assuming that it was the problem in all cases. |
|
|
|
I will see if I can add some code at the end of the task to make sure all python processes are killed and the main program exits correctly. And send another testing round.
Another observation in that this problem does not seem to be OS-dependant, since it happened to STARBASEn in a Linux machine and to Richard in Windows.
I haven't gotten a new beta yet so I will shut off all GPU work with other projects to hopefully get some and help resolve this issue. |
|
|
|
One other after thought re that WU. I had checked my status page here prior to aborting the task. It indicated the task was still in progress so no disposition of the files that I am presuming were sent back sometime in the past (since the slot was empty) was assigned to it. Wonder where they went? |
|
|
|
Can anybody explain credits policy please.
My CPU's running Python app relentlessly for up to 7 days for only 50,000 credits. Yet have received 360,000 credits for the ACEMD 3 after only 42,000 secs (11.6 hrs). Bit skewiff.. see below:
https://www.gpugrid.net/results.php?userid=562496
Task
click for details
Show names Work unit
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Credit Application
32877811 27214361 590351 1 Apr 2022 | 9:34:34 UTC 3 Apr 2022 | 9:57:48 UTC Completed and validated 309,332.50 309,332.50 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32877804 27214354 581235 1 Apr 2022 | 9:38:33 UTC 3 Apr 2022 | 19:38:13 UTC Completed and validated 628,304.20 628,304.20 50,000.00 Python apps for GPU hosts beta v1.10 (cuda1131)
32876508 27207895 581235 29 Mar 2022 | 9:50:08 UTC 1 Apr 2022 | 4:52:45 UTC Completed and validated 101,951.50 100,984.90 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32876455 27213533 581235 29 Mar 2022 | 9:17:17 UTC 29 Mar 2022 | 9:49:31 UTC Completed and validated 12,109.13 12,109.13 3,000.00 Python apps for GPU hosts beta v1.09 (cuda1131)
32876341 27213457 590351 29 Mar 2022 | 4:33:52 UTC 31 Mar 2022 | 6:41:54 UTC Completed and validated 42,830.17 41,435.17 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
32875459 27212897 581235 27 Mar 2022 | 2:32:46 UTC 29 Mar 2022 | 9:06:58 UTC Completed and validated 96,228.49 95,544.64 360,000.00 ACEMD 3: molecular dynamics simulations for GPUs v2.19 (cuda1121)
PS: How do I past neat image of above?? |
|
|
|
Please note that other users can't see your entire task list by userid - that's a privacy policy common to all BOINC projects.
The ones you're worried about seem to be Results for host 581235
The one you're specifically asking about - the Python GPU beta v1.10 - was issued on Friday morning and returned on Sunday evening: it was only on your machine for about 58 hours. The run time of 628,304 seconds is misleading (a duplicate of the CPU time) and an error on this website.
Runtime and credit are still being adjusted, and errors are a common feature of beta testing. Sometimes you win, others (like this one) you lose. I'm sure your comments will be noted before testing is complete. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
For some reason I haven't been able to snag any of the Python beta tasks lately.
Just the old stock Python tasks.
Couple of them failed at 30 minutes with the no progress downloading the Python environment after 1800 seconds.
One of the reasons I would like to get the new beta tasks that overcome that issue.
Also found a task at 5 hours and counting at 100% completion and not reporting. Suspended the task and resumed in the hope that would nudge it to report but it just restarted at 10% progress.
[Edit] Looks like the suspend/resume was the trick after all. Uploading now. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The credits system is proportional to the amount of compute required to complete each task, like in acemd3.
In acemd3, it is proportional to the complexity of the simulation. In python tasks, which train artificial intelligence reinforcement learning agents, is proportional to the amount of interactions between the agent and its simulated environment required for the agent to learn how to behave in it.
At the moment, we give 2000 credits per 1M interactions, and most tasks require 25M training interactions (except test task which are shorter, normally just 1M). Therefore, completing a task gives 50000 credits and 75000 if completed specially fast.
Note that we are in beta phase, and while the credit difference between acemd and pythonGPU jobs should not be huge, we might need to adjust the credits given per 1M interactions to make them equivalent.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Batches of both pythonGPU and pythonGPUBeta are being sent out this week. Hopefully pythonGPUBeta task will run without issues.
We want to wait a bit more in case more bugs are detected, but we will soon update the pythonGPU app with the code from PythonGPUBeta, which seems to work well now. As mentioned, it does not have the problem of installing conda every time (instead downloads the packed environment only the first time). It also works for Linux and Windows.
At that point we will keep PythonGPUBeta only for testing.
____________
|
|
|
|
So far some run well while other ran for 2 and 3 days.
I did abort the ones that are still running after 3 days.
I will pick back up in the Fall and I hope to see good running tasks on my GPU's.
For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Looks like the standard BOINC mechanism of complain in a post on the forums on some topic and the BOINC genies grant your wish.
Been getting nothing but solid Python beta tasks now for the past couple of days. |
|
|
WR-HW95Send message
Joined: 16 Dec 08 Posts: 7 Credit: 1,436,322,786 RAC: 338,160 Level
Scientific publications
|
I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"
Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79. |
|
|
|
I have serious problems with my other machine running 1080Ti.
So far from 20 tasks past 2 weeks best one has ran around 38secs before error.
I tried to underpower + underclock core and mem, still same result around same time.
This one is result of last one.
"<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
10:11:26 (15136): wrapper (7.9.26016): starting
10:11:26 (15136): wrapper: running bin/acemd3.exe (--boinc --device 0)
10:11:29 (15136): bin/acemd3.exe exited; CPU time 0.000000
10:11:29 (15136): app exit status: 0xc0000135
10:11:29 (15136): called boinc_finish(195)"
Is there something wrong in newer drivers on nvidia?
Only difference between machines that works and doesnt beside cpu (3900x and 5900x)is gfx driver version.
Machine that runs tasks has driver 496.49.
Machine that fails tasks has driver 511.79.
you can try changing the driver back and see? easy troubleshooting step. It's definitely possible to be the driver.
but you seem to be having an issue with the ACEMD3 tasks, this thread is about the Python tasks.
____________
|
|
|
WR-HW95Send message
Joined: 16 Dec 08 Posts: 7 Credit: 1,436,322,786 RAC: 338,160 Level
Scientific publications
|
Sorry for posting wrong thread.
Changed drivers to 496.49 on other machine too... now just have to wait to get some work to see does it work.
Personally I was really hoping when new things were coming, that this project would ditch the cuda at last and moved to opencl.
No project that I have crunched on opencl have had extended issues like this. And most of those projects run on AMD cards too. |
|
|
|
I've had no problems with their CUDA ACEMD3 app. it's been very stable across many data sets. all of the issues raised in this thread are in regards to the Python app that's still in testing/beta. problems are to be expected.
CUDA outperforms OpenCL. even it identical code (as much as it can be), there is always the added overhead of needing to compile the opencl code at runtime. whereas CUDA runs natively on Nvidia. most projects run opencl because it lets them more easily port the code to different devices, expanding their user base at the expense of some performance overhead.
there have been many problems with the 500+ series drivers though. if you still have issues with the older drivers then it's something else wrong with your setup. if you didnt totally purge the old drivers with DDU from Safe Mode and re-install from a fresh nvidia package, that's a good first step. sometimes driver corruption can linger acropss many driver removals and upgrades and it needs to be more forcefully removed.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
bcavnaugh wrote:
... For now I an waiting for new 3 & 4 on two of my hosts, it is a real bummer that our hosts have to sit for days on end without getting any tasks.
you say it, indeed :-(
Obviously, ACEMD has very low priority at GPUGRID these days :-(
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Beta is still having issues with establishing the correct Python environment.
Threw away around 27 tasks today with errors because of:
TypeError: object of type 'int' has no len()
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
thanks, this is solved now. A new batch is running without this issue.
____________
|
|
|
|
There are still a few old tasks around. I got the _9 (and hopefully final) issue of WU 27184379 from 19 March. It registered the 51% mark but hasn't moved on in over 3 hours: I'm afraid it's going the same way as all previous attempts. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes, I am still getting the bad work unit resends.
Too bad they couldn't be purged before hitting the _9 timeout. |
|
|
|
New tasks today.
But: "ModuleNotFoundError: No module named 'yaml'" |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Same here today. |
|
|
AzmodesSend message
Joined: 7 Jan 17 Posts: 34 Credit: 1,371,429,518 RAC: 0 Level
Scientific publications
|
Same. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks for the feedback. I will look into it today.
In which OS?
____________
|
|
|
|
In which OS?
These were "Python apps for GPU hosts v4.01 (cuda1121)", which is Linux only.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Right I just saw it browsing thought the failed jobs. It seems that is in the PythonGPU app not in PythonGPUBeta.
This is what I think happened: since in PythonGPU the conda environment is created every time, it could be that some of the dependencies from one or more packages required have changed recently. Therefore, yaml package was not installed in the environments and was missing during execution.
This is one more reason to switch to the new approach (currently beta). The conda environment is created, packed and sent to the volunteer machine when executing the first job. There, the environment is simply unpacked and there is no need to send a new one unless some fix in required.
We will move the PythonGPUBeta app to PythonGPU. Now PythonGPUBeta is quite stable, and its approach avoids this kind of problems. I expect we can do it today, but I will post to confirm it.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The current version of PythonGPUBeta has been copied to PythonGPU
Seems like the task DISK_LIMIT needs to be increased, I have seen some EXIT_DISK_LIMIT_EXCEEDED errors. We will adjust it.
____________
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
Well this is interesting to read.
Over at RAH they are using Python (cpu) and they are memory and disk space hogs.
I suggest once you get your GPU tasks working you make a FAQ on minimum memory and disk space needed to run these tasks.
One task in CPU uses 7.8 compressed to 8.4GB actual space on the drive.
Memory wise it uses 2861MB of physical ram and 55 to 58 MB of virtual.
If your tasks for GPU are anything like these...well we will need a bit of free space.
Looking forward to reading about your success getting python running on GPU. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The size for all the app files (including the compressed environment) are:
2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131
The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?
Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Also, the PythonGPU app version used in the new jobs should be 402 (or 4.02).
If that is not the case, there is probably some problem. It should be automatically used, but if that is not the case resetting the app should help.
____________
|
|
|
|
I have e1a46-ABOU_rnd_ppod_avoid_cnn_3-0-1-RND3588_4 running under Linux. I can confirm that my task (and its four predecessors) are running with the v4.02 app.
Small point: can you apply a "weight" to the sub-tasks in job.xml, please? At the moment, the 'decompress' stage is estimated to take 50% of the runtime under Linux, and 66% under Windows. That throws out the estimate for the rest of the run.
Under Linux, my slot directory is occupying 9.8 GB, against an allowed limit of 10,000,000,000 bytes: that's tight, especially when you consider the divergence of binary and decimal representations for bigger files.
All my predecessors for this workunit were running Windows. Three failed on disk limits, and one on memory limits. If every Windows version is using the 7-zip decompressor, there's the extra 'de-archived, but still compressed' step to allow for in the disk limit.
Still awaiting the final hurdle - the upload file size limit. In about 4 hours' time, I reckon - currently at 85% after 10 hours. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks a lot for the info Richard!
You are right, I should adjust the weights of the subtasks in job.xml to 10% for 'decompress' and 90% to execute the python script. That maybe also explains why jobs were getting stuck at 50% when python was not closed properly between jobs. The new job could decompress the environment (50%), but the python script could be executed.
I have increased the allowed limit to 30,000,000,000 bytes. This should affect all new jobs (to be confirmed) and should solve the DISK LIMIT problems.
Finally, I was also thinking about sending the compressed environment as a tar.bz2 file instead of a tar.gz to make it smaller. But I have to test that 7-zip handles it correctly.
Probably will deploy these changes first in PythonGPUBeta, that is what it is for
____________
|
|
|
|
I'd say 1%::99%, but thanks. |
|
|
|
Uploaded and reported with no problem at all. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
has the allowed limit changed to 30,000,000,000 bytes?
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Appears so.
<rsc_disk_bound>30000000000.000000</rsc_disk_bound> |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
The size for all the app files (including the compressed environment) are:
2.0G for windows with cuda102
2.7G for windows with cuda1131
1.8G for linux with cuda102
2.6G for linux with cuda1131
The additional task specific data goes from a few KB to a few MB. I did not expect 7.8G compressed (not even after unpacking the environment). Is that the case for all PythonGPU tasks now?
Regarding CPU/GPU usage, this app actually uses a combination of both due to the nature of the problem we are tackling (training AI agent to develop intelligent behaviour in a simulated environment with reinforcement learning techniques). Interactions with the agent environment happen in CPU, learning happens in GPU.
Note: I was commenting on Rosetta at home CPU pythons.
What yours do, I don't know. I guess i had better add your project and see what happens.
I readded your project to my system, so if I am home when a task is sent out, I'll have a look. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you!
I have added the subtask weights to the PythonGPUbeta app. Currently testing it with a small batch of tasks.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Testing was successful, so we can add the weights to the PythonGPU app job.xml file
____________
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
abouh,
can you have a look at my comments in a thread I created.
The 4.0 task was not increasing in percentage done after watching it for 10 minutes. Time to completion kept jumping around 1 second up 1 second down.
40 minutes run time vs cpu time? That a hell of a lot of set up time!
Here are the local host task details
Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898
State Running
Received 4/15/2022 12:06:46 PM
Report deadline 4/20/2022 12:06:46 PM
Estimated app speed 53.74 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.987 CPUs + 1 NVIDIA GPU (GTX 1050)
CPU time at last checkpoint 06:44:35
CPU time 06:47:39
Elapsed time 06:05:04
Estimated time remaining 198d,09:49:25
Fraction done 7.880%
Virtual memory size 7,230.02 MB
Working set size 2,057.87 MB |
|
|
mikeySend message
Joined: 2 Jan 09 Posts: 297 Credit: 6,132,781,625 RAC: 30,286,919 Level
Scientific publications
|
You can delete the previous post about ACMED3. I posted that incorrectly here.
Some forums let you put a double space or a double period to delete your own post, but you must still do it within the editing time |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
Mikey, I know. But the time limit expired on that post to edit it. I came back days later not within the 30-60 minutes allowed.
|
|
|
|
I am now running a Python task. It has a very low usage of my GPU most often around 5 to 10%, occasionally getting up to 20%. Is this normal? Should I wait until I move my GPU from an old 3770K to a 12500 computer for better CPU capabilities to do these tasks? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
This is normal for Python on GPU tasks. The tasks run on both the cpu and gpu during parts of the computation for the inferencing and machine learning segments.
Read the posts by the admin developer explaining what the process involves.
- cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase. It is correct. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Sorry for the late reply Greg _BE, I hid the ACEMD3 posts.
I checked your job e2a18-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND3898. Did the progress get stuck or was it just increasing slowly?
The job was finally completed by another Windows 10 host, but the CPU time is wrong because it says 668566.9 seconds.
I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it.
____________
|
|
|
|
New tasks being issued this morning, allocated to the old Linux v4.01 'Python app for GPU hosts' issued in October 2021.
All are failing with "ModuleNotFoundError: No module named 'yaml'". |
|
|
|
I am not sure, but maybe one problem is that we ask only for 0.987 CPUs, since that was ideal for ACEMD jobs. In reality Python tasks use more. I will look into it.
Asking for 1.00 CPUs (or above) would make a significant difference, because that would prompt the BOINC client to reduce the number of tasks being run for other projects.
It would be problematic to increase the CPU demand above 1.00, because the CPU loading is dynamic - BOINC has no provision for allowing another project to utilise the cycles available during periods when the GPUGrid app is quiescent. Normally, a GPU app is given a higher process priority for CPU usage than a pure CPU app, so the operating system should allocate resources to your advantage, but that can be problematic when the wrapper app is in use. That was changed recently: I'll look into the situation with your server version and our current client versions. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Definitely only the latest version 403 should be sent. Thanks for letting us know.
____________
|
|
|
|
BOINC GPU apps, wrapper apps, and process priority
The basic rule for BOINC applications (originally CPU only) has been to run applications at idle priority, to avoid interfering with foreground use of the computer.
Since the introduction of GPU apps into BOINC around 2008, the CPU portion of a GPU app has been automatically run at a slightly higher process priority (below normal) - an attempt to avoid highly-productive GPU work being throttled by competition for CPU resources.
Normally, the BOINC client manages these two different process priorities directly. But when a wrapper app is interpolated between the client and a worker app, it's the wrapper which sets the priority for the worker app. It was a user on this project who first noticed (Issue 3764 - May 2020) that the process priority of a GPU app wasn't being set correctly when it was executing under the control of a wrapper app.
Many false starts later (PRs 3826, 3948, 3988, 3999), a fully consistent set of process priority tools was developed, effective from about 25 September 2020.
But in order for these tools to be useful, compatible versions of both the BOINC client and the wrapper application have to be used. So far as I can tell, BOINC client for Windows v7.16.20 (current) is compliant; Wrapper version 26203 is compliant; but no full public release versions of the BOINC client for Linux are yet compliant (Gianfranco Costamagna's prototyping PPA client should be).
This project appears to be using wrapper code 26016 for Windows, and wrapper code 26198 for Linux. Unless these have been patched locally, neither wrapper will yet allow full process control management.
It's not urgent, but with the new Python apps running in a mixed CPU/GPU environment, it might be helpful to update the project's wrapper codebase. Fortunately, the basic server platform is unaffected by all this. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We have deprecated v4.01
Hopefully, if everything went fine, the error
All are failing with "ModuleNotFoundError: No module named 'yaml'".
should not happen any more. And all jobs should use v4.03
____________
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
abouh,
I got another python finally.
But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?
Another observation, elapsed time vs CPU time. The two are off by about 5 hours.
4:01 vs 8:54 currently
Progress is not moving very fast. In the time it has taken me to write this it is stuck at 7.88%
Now 4:16 to 9:24 and still 7.88%!!, 15 mins and no progress? If this hasn't changed in the next hour, I am also aborting this task.
BTW, 46 checkpoints in the 4hrs of run time.
https://www.gpugrid.net/workunit.php?wuid=27219917
Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15
Exit status 203 (0xcb) EXIT_ABORTED_VIA_GUI
Computer ID 590211
Run time 241,306.00
CPU time 1,471.50
GeForce RTX 3080 Ti (4095MB) driver: 497.
The point of this information is:
1)I have GTX 1050 and 1080. Previous python failed with the same exit error as the first person in this python task. What is EXIT_CHILD_FAILED? Something on your end or on our end?
2) Person 2 probably aborted because of the way BOINC reads the data to determine the time. I killed my first python because it shows 160+ days to completion.
***I give up. No progress in 30 minutes since I started this post***
Computer: DESKTOP-LFM92VN
Project GPUGRID
Name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256_2
Application Python apps for GPU hosts 4.03 (cuda1131)
Workunit name e5a13-ABOU_rnd_ppod_avoid_cnn_4-0-1-RND0256
State Running
Received 4/27/2022 4:35:18 PM
Report deadline 5/2/2022 4:35:18 PM
Estimated app speed 3,171.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.987 CPUs + 1 NVIDIA GPU (device 1)
CPU time at last checkpoint 09:58:18
CPU time 10:08:59
Elapsed time 04:37:57
Estimated time remaining 161d,06:23:41
Fraction done 7.880%
Virtual memory size 6,429.20 MB
Working set size 1,072.13 MB
Directory slots/12
Process ID 16828
Debug State: 2 - Scheduler: 2
That's 4:01 to 4:38 and still at 7.88%
Checkpoints count up. CPU is 219%
This is all messed up.
I join the abort team.
------------
Something about the other task that failed with exit child.
A few extracts:
wandb: Network error (ReadTimeout), entering retry loop.
Exception in thread StatsThr:
Traceback (most recent call last):
File "D:\data\slots\13\lib\site-packages\psutil\_common.py", line 449, in wrapper
ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'
During handling of the above exception, another exception occurred:
(followed by line this and line that, etc)
And then this:
OSError: [WinError 1455] The paging file is too small for this operation to complete
But the next person who got has this kind of setup:
CPU type AuthenticAMD
AMD Ryzen 5 5600X 6-Core Processor [Family 25 Model 33 Stepping 0]
Number of processors 12
Coprocessors NVIDIA NVIDIA GeForce RTX 3080 (4095MB) driver: 512.15
Operating System Microsoft Windows 11
x64 Edition, (10.00.22000.00
I run GTX and Win10 with a Ryzen 7 2800 and 7.16.20 BOINC |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?
Because the task was actually using a little more than two cores to process the work.
Why I have set Python task to allocate 3 cpu threads for BOINC scheduling. |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
But here is something interesting, the CPU value according to BOINC Tasks is 221%!
How can you get more than 100% of a single core?
Because the task was actually using a little more than two cores to process the work.
Why I have set Python task to allocate 3 cpu threads for BOINC scheduling.
Ok...interesting, but what accounts for the lack of progress in 30 mins on this task that I just killed and the exit child error and blow up on the previous Python?
I mean really...0% with 2 decimal points, 7.88 for more than 30 minutes?
I don't know of any project that can't even 1/100th in 30 minutes.
I've seen my share of slow tasks in other projects, but this one...wow....
And how do you go about setting just python for 3 cpu cores? That's beyond my knowledge level. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You use an app_config.xml file in the project like this:
<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config> |
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
You use an app_config.xml file in the project like this:
<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPUbeta</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
Ok thanks. I will make that file tomorrow or this weekend. To tired to try that tonight.
|
|
|
|
We have deprecated v4.01
Hopefully, if everything went fine, the error
All are failing with "ModuleNotFoundError: No module named 'yaml'". should not happen any more. And all jobs should use v4.03
I've recently reset Gpugrid project at every of my hosts, but I've still received v4.01 at several of them, and failed with the mentioned error.
Some subsequent v4.03 resends for the same tasks have eventually succeeded at other hosts. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Unfortunately the admins never yanked the malformed tasks from distribution.
They only will disappear when they hit the 7th (_6) resend and it fails. Then it will be pulled from distribution. (Too many errors (may have bug))
I've had a lot of the bad Python 4.01 tasks also but thankfully a lot of them were at the tail end of distribution. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Sorry for the late reply Greg _BE, I was away for the last 5 days. Thank you very much for the detailed report.
----------
1. Regarding this error:
Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 589200
Exception: The wandb backend process has shutdown
GeForce GTX 1050 (2047MB) driver: 512.15
Seems like the process failed after raising the exception: "The wandb backend process has shutdown". wandb is the python package we use to send out logs about the agent training process. It provides useful information to better understand the task results. Seems like the process failed and then the whole task got stuck, that is why no progress was being made. Since it reached 7.88% progress, I assume it worked well until then. I need to review other jobs to see why this could be happening and if it happened in other machines. We had not detected this issue before. Thanks for bringing it up.
----------
2. Time estimation is not right for now due to the way BOINC makes it, Richard provided a very complete explanation in a previous posts. We hope it will improve over time... for now be aware that is it completely wrong.
----------
3. Regarding this error:
OSError: [WinError 1455] The paging file is too small for this operation to complete
It is related to using pytorch in windows. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback
We are applying this solution to mitigate the error, but for now it can not be eliminated completely.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Seems like deprecating the version v4.01 did not work then... I will check if there is anything else we can do to enforce usage of v4.03 over the old one.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You need a to send a message to all hosts when they connect to the scheduler to delete the 4.01 application from the host physically and to delete the entry in the client_state.xml file |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I sent a batch which will fail with
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/apply:numpy.core.multiarray.scalar'
It is just an error with the experiment configuration. I immediately cancelled the experiment and fixed the configuration, but the tasks were already sent.
I am very sorry for the inconvenience. Fortunately the jobs will fail right after starting, so no need to kill them. The another batch contains jobs with the fixed configuration.
____________
|
|
|
|
I was not getting too many of the python work units, but I recently received/completed one. I know they take... a while to complete.
Specifically, I am looking at task 32892659, work unit 27222901.
I am glad it completed, but it was a long haul.
It was mentioned that "completing a task gives 50000 credits and 75000 if completed specially fast"
How fast do these need to complete for 75000? I am not saying I have the fastest processors but they are definitely not slow (they are running at ~3GHz with the boost) and the GPUs are definitely not slow.
Thanks! |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.
You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days. |
|
|
|
I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.
You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.
Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks? |
|
|
|
I get the full "quick" credits for my Python tasks because I normally crunch them in 5-8 hours.
You took more than 2 days to report yours. You get a boost of 50% if returned within 1 day and 25% boost in credit if returned with 2 days.
Got it. Thanks! I think I am confused why this task took so long to report. What is usually the "bottleneck" when running these tasks?
these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.
____________
|
|
|
|
these tasks are multi-core tasks. they will use a lot of cores (maybe up to 32 threads?). are you running CPU work from other projects? if you are then it's probably starved on CPU resources trying to run the Python task.
The critical point being that they aren't declared to BOINC as needing multiple cores, so BOINC doesn't automatically clear extra CPU space for them to run in. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
yes, the tasks run 32 agent environments in parallel python processes. Definitely the bottleneck could be the CPU because BOINC is not aware of it.
____________
|
|
|
|
Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory? |
|
|
|
Thank you all for the replies- this was exactly the issue. I will keep that in mind if I receive another one of these work units. Theoretically, is it possible to run several of these tasks in parallel on the same GPU, since it really is not too GPU intensive and I have enough cores/memory?
Only if you have more than 64 threads per GPU available and you stop processing of any existing CPU work.
____________
|
|
|
|
abouh asked Right, I wish there was a way to specify that to BOINC on our side... does adjusting the app_config.xml help? I guess that has to be done of the user side
I tried that, but boinc manager on my pc will overallocate CPU's. I am currently running multicore atlas cpu tasks from lhc alongside the python tasks from gpugrid. The atlas tasks are set to use 8 CPU's and the python tasks are set to use 10 CPU's. The example for this response is on an AMD cpu with 8 cores/16 threads. BOINC is set to use 15 threads. It will run one gpugrid python 10 thread task and one lhc 8 thread task at the same time. That is 18 threads running on a 15 thread cpu.
Here is my app_config for gpugrid:
<app_config>
<app>
<name>acemd3</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>PythonGPU</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>
<app>
<name>PythonGPUbeta</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>
<app>
<name>Python</name>
<cpu_usage>10</cpu_usage>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>10</cpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<plan_class>cuda1121</plan_class>
<avg_ncpus>10</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 10</cmdline>
</app_version>
</app>
<app>
<name>acemd4</name>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
</app_config>
And here is my app_config for lhc:
<app_config>
<app>
<name>ATLAS</name>
<cpu_usage>8</cpu_usage>
</app>
<app_version>
<app_name>ATLAS</app_name>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<avg_ncpus>8</avg_ncpus>
<cmdline>--nthreads 8</cmdline>
</app_version>
</app_config>
If anyone has any suggestions for changes to the app_config files, please let me know.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I can run 2 jobs manually on my machine with 12 CPUs, in parallel. They are slower than a single job, but much faster than running them sequentially.
Specially since the jobs iterate between using CPU and using GPU. 2 jobs won't be completely synchronous so as long as the GPU has enough memory.
However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.
____________
|
|
|
|
However, I think currently GPUGrid automatically assigns one job per GPU, with the environment variable GPU_DEVICE_NUM.
Normally, the user's BOINC client will assign the GPU device number, and this will be conveyed to the job by the wrapper.
You can easily run two jobs per GPU (both with the same device number), and give them both two full CPU cores each, by using an app_config.xml file including
...
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>2.0</cpu_usage>
</gpu_versions>
...
(full details in the user manual) |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I see, thanks for the clarification
____________
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
I guess I am going to have to give up on this project.
All I get is exit child errors. Every single task.
For example: https://www.gpugrid.net/result.php?resultid=32894080 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.
I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.
____________
|
|
|
Greg _BESend message
Joined: 30 Jun 14 Posts: 135 Credit: 121,356,939 RAC: 28,544 Level
Scientific publications
|
This task is from a batch of a wrongly configured jobs. It is an error on our side. It was immediately corrected, but the jobs were already sent, and could not be cancelled. They crash after starting to runm, but it is just this batch. The following batches work normally.
I mentioned it in a previous post, sorry for the problems... this specific job would have crashed anywhere.
ok...waiting in line for the next batch.
|
|
|
|
I am still attempting to diagnose why these tasks are taking the system so long to complete. I changed the config to "reserve" 32 cores for these tasks. I did also make a change so I have two of these tasks running simultaneously- I am not clear on these tasks and multithreading. The system running them has 56 physical cores across two CPUs (112 logical). Are the "32" cores used for one of these tasks physical or logical? Also, I am relatively confident the GPUs can handle this (RTX A6000) but let me know if I am missing something. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Why do you think the tasks are running abnormally long?
Have you ever looked at the wall clock to see how long they take from start to finish.
You are running and finishing them well within the 5 day deadline.
You are finishing them in two days and get the 25% bonus credits.
Are you being confused by the cpu and gpu runtimes on the task?
That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.
You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.
|
|
|
|
Why do you think the tasks are running abnormally long?
Have you ever looked at the wall clock to see how long they take from start to finish.
You are running and finishing them well within the 5 day deadline.
You are finishing them in two days and get the 25% bonus credits.
Are you being confused by the cpu and gpu runtimes on the task?
That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.
You don't really need that much cpu support. The task is configured to run on 1 cpu as delivered.
They should be put back into the beta category. They still have too many bugs and need more work. It looks like someone was in a hurry to leave for summer vacation. I decided to stop crunching them, for now. Of course, there isn't much to crunch here anyway, right now.
There is always next fall to fix this.....................
|
|
|
|
Are you being confused by the cpu and gpu runtimes on the task?
That is the accumulated time across all 32 threads you appear to be running them on. That does not indicate the real walltime calculation. If you ran them on less threads, the accumulated time would be much less.
They are declared to use less than 1 CPU (and that's all BOINC knows about), but in reality they use much more.
This website confuses matters by mis-reporting the elapsed time as the total (summed over all cores) CPU time.
The only way to be exactly sure what has happened is to examine the job_log_[GPUGrid] file on your local machine. The third numeric column ('ct ...') is the total CPU time, summed over all cores: the penultimate column ('et ...') is the elapsed - wall clock - time for the task as a whole.
Locally, ct will be above et for the task as a whole, but on this website, they will be reported as the same.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I'm not having any issues with them on Linux. I don't know how that compares to Windows hosts.
I get at least a couple a day per host for the past several weeks.
Nothing like a month ago when there were a thousand or so available.
I doubt we ever return to the production of years ago unfortunately. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The 32 cores are logical, python processes running in parallel. I can run them locally in a 12 CPU machine. The GPU should be fine as well, so you are correct about that.
We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.
It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We decided to remove the beta flag from the current version of the python app when we found it to work without errors in a reasonable number hosts. We are aware that, even though we do testing it in our local linux and windows machines, there is a vast variety of configurations, versions and resource capabilities among the hosts, and it will not work in all of them.
However, please note that in research at some point we need to start doing experiments (I want to talk more about that in my next post). Further testing and fixing is required and we are committed to do it. This takes a long time, so we need to work in both things in parallel. We will still use the beta app to test new versions.
Please, if you are talking about a recurring specific problem in your machines, let me know and will look into it.
____________
|
|
|
|
I'm away from my machines at the moment, but can confirm that's the case.
Look at task 32897902. Reported time 108,075.00 seconds (well over a day), but got 75,000 credits. It was away from the server for about 11 hours. GTX 1660, Linux Mint. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I am not sure about the acemd tasks, but for python tasks, I will increase the amount of tasks progressively.
To recap a bit about what we are doing, we are experimenting with populations of machine learning agents, trying to figure out how important are social interactions and information sharing for intelligent agents. More specifically, we train multiple agents for periods of time in different GPUGrid machines, which later return to the server to report their results. We are researching what kind of information they can share and how to build a common knowledge base, similar to what we humans do. Following, new generations of the populations repeat the process, but already equipped with the knowledge distilled by previous generations.
At the moment we have several experiments running with population sizes of 48 agents, that means a batch of 48 agents every 24-48h. We also have one experiment of 64 agents and one of 128. To my knowledge no recent paper has tried with more than 80, and we plan to keep increasing the population sizes to figure out how relevant that is for agent intelligent behavior. Ideally I would like to reach population sizes of 256, 512 and 1024.
____________
|
|
|
|
Thanks for this info. Here is the log file for a recently completed task:
1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0
So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Thanks for this info. Here is the log file for a recently completed task:
1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0
So, the clock time is 117973.295733? Which would be ~32 hours of actual runtime?
No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.
Look at the sent time and the returned time to calculate how long the task actually took to process. Returned time minus the sent time = length of time to process. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
BOINC just does not know how to account for these Python tasks which act "sorta" like an MT task.
But BOINC does not handle MT tasks correctly either for that matter.
Blame it on the BOINC code which is old. Like it knows how to handle a task on a single cpu core and that is about all it gets right. |
|
|
|
1653158519 ue 148176.747654 ct 3544023.000000 fe 1000000000000000000 nm e5a63-ABOU_rnd_ppod_demo_sharing_large-0-1-RND5179_0 et 117973.295733 es 0
No. That is incorrect. You cannot use the clocktime reported in the task. That will accumulate over however many cpu threads the task is allowed to show to BOINC. Blame BOINC for this issue not the application.
Actually, that line (from the client job log) actually is a useful source of information. It contains both
ct 3544023.000000
which is the CPU or core time - as you say, it dates back to the days when CPUs only had one core. But now, it comprises the sum over all of however many cores are used.
and et 117973.295733
That's the elapsed time (wallclock measure) which was added when GPU computing was first introduced and cpu time was not longer a reliable indicator of work done.
I agree that many outdated legacy assumptions remain active in BOINC, but I think it's got beyond the point when mere tinkering could fix it - we really need a full Mark 2 rewrite. But that seems unlikely under the current management. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
OK, so here is a back of the napkin calculation on how long the task actually took to crunch
Take the et time from the job_log entry for the task and divide by 32 since the tasks spawn 32 processes on the cpu to account for the way that BOINC calculates cpu_time accumulated across all cores crunching the task.
So 117973.295733 / 32 = 3686.665491656 seconds
or in reality a little over an hour to crunch.
That agrees with the wall clock time (reported - sent) times I have been observing for the shorty demo tasks that are currently being propagated to hosts.
|
|
|
|
Well, since there's also a 'nm' (name) field in the client job log, we can find the rest:
Task 32897743, run on host 588658.
Because it's a Windows task, there's a lot to digest in the std_err log, but it includes
04:44:21 (34948): .\7za.exe exited; CPU time 9.890625
04:44:21 (34948): wrapper: running python.exe (run.py)
13:32:28 (7456): wrapper (7.9.26016): starting
13:32:28 (7456): wrapper: running python.exe (run.py) (that looks like a restart)
Then some more of the same, and finally
14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0) |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
14:41:51 (28304): python.exe exited; CPU time 2816214.046875
14:41:56 (28304): called boinc_finish(0)
So 2816214 / 32 = 88006 seconds
88006 / 3600 seconds = 24.44 hours
That is close to matching the received time minus the sent time of a little over a day.
The task did'nt get the full 50% credit bonus for returning within 24 hours but did get the 25% bonus.
I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.
|
|
|
|
I'm very surprised that that card is so slow or that the card is that slow when working with a cpu clocked to 2.7Ghz in Windows.
That is what I am confused about. I can tell you that these calculations of time seem accurate- it was somewhere around 24 hours that it was actually running. Also, the CPU was running closer to 3.1Ghz (boost). It barely pushed the GPU when running. Nothing changed with time when I reserved 32 cores for these tasks. I really can't nail down the issue. |
|
|
|
As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase." (message 58590). Any instantaneous observation won't reveal the full situation: either CPU will be high, and GPU low, or vice-versa. |
|
|
|
Yep- I observe the alternation. When I suspend all other work units, I can see that just one of these tasks will use a little more than half of the logical processors. I know it has been talked about that although it says it uses 1 processor (or, 0.996, to be exact) that it uses more. I am running E@H work units and I think that running both is choking the CPU. Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.
I am not sure there is a way to post images, but here are some links to show CPU and GPU usage when only running one python task. Is it supposed to use that much of the CPU?
https://i.postimg.cc/Kv8zcMGQ/CPU-Usage1.jpg
https://i.postimg.cc/LX4dkj0b/GPU-Usage-1.jpg
https://i.postimg.cc/tRM0PZdB/GPU-Usage-2.jpg
I am sorry for all of the questions.... just trying my best to understand.
|
|
|
|
As abouh has posted previously, the two resource types are used alternately - "cyclical GPU load is expected in Reinforcement Learning algorithms. Whenever GPU load in lower, CPU usage should increase.
This can be very well graphically noticed at the following two images.
Higher CPU - Lower GPU usage cycle:
Higher GPU - Lower CPU usage cycle:
CPU and GPU usage graphics follow an anti cyclical pattern. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Is there a way to limit the processor count that these python tasks use? In the past, I changed the app config to use 32, but it did not seem to speed anything up, even though they were reserved for the work unit.
I am sorry for all of the questions.... just trying my best to understand.
No there isn't as the user. These are not real MT tasks or any form that BOINC recognizes and provides some configuration options.
Your only solution is to only run one at a time via an max_concurrent statement in an app_config.xml file and then also restrict the number of cores being allowed to be used by your other projects.
That said, I don't know why you are having such difficulties. Maybe chalk it up to Windows, I don't know.
I run 3 other cpu projects at the same times as I run the GPUGrid Python on GPU tasks with 28-46 cpu cores being occupied by Universe, TN-Grid or yoyo depending on the host. Every host primarily runs Universe as the major cpu project.
No impact on the python tasks while running the other cpu apps.
|
|
|
|
No impact on the python tasks while running the other cpu apps.
Conversely, I notice a performance loss on other CPU tasks when python tasks are in execution.
I processed yesterday python task e7a30-ABOU_rnd_ppod_demo_sharing_large-0-1-RND2847_2 at my host #186626
It was received at 11:33 UTC, and result was returned on 22:50 UTC
At the same period, PrimeGrid PPS-MEGA CPU tasks were also being processed.
The medium processing time for eighteen (18) PPS-MEGA CPU tasks was 3098,81 seconds.
The medium processing time for 18 other PPS-MEGA CPU tasks processed outside that period was 2699,11 seconds.
This represents an extra processing time of about 400 seconds per task, or about a 12,9% performance loss.
There is not such a noticeable difference when running Gpugrid ACEMD tasks. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I also notice an impact on my running Universe tasks. Generally adds 300 seconds to the normal computation times when running in conjunction with a python task. |
|
|
|
Windows 10 machine running task 32899765. Had a power outage. When the power came back on, task was restarted but just sat there doing nothing. The stderr.txt file showed the following error:
file pythongpu_windows_x86_64__cuda102.tar
already exists. Overwrite with
pythongpu_windows_x86_64__cuda102.tar?
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?
Task was stalled waiting on a response.
BOINC was stopped and the pythongpu_windows_x86_64__cuda102.tar file was removed from the slots folder.
Computer was restarted then the task was restarted. Then the following error message appeared several times in the stderr.txt file.
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!
Page file size was increased to 64000MB and rebooted.
Started task again and still got the error message about page file size too small. Then task abended.
If you need more info about this task, please let me know. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you captainjack for the info.
1.
Interesting that the job gets stuck with:
(Y)es / (N)o / (A)lways / (S)kip all / A(u)to rename all / (Q)uit?
The job command line is the following:
7za.exe pythongpu_windows_x86_64__cuda102.tar -y
and I got from the application documentation (https://info.nrao.edu/computing/guide/file-access-and-archiving/7zip/7z-7za-command-line-guide):
7-Zip will prompt the user before overwriting existing files unless the user specifies the -y
So essentially -y assumes "Yes" on all Queries. Honestly I am confused by this behaviour, thanks for pointing it out. Maybe I am missing the x, as in
7za.exe x pythongpu_windows_x86_64__cuda102.tar -y
I will test it on the beta app.
2.
Regarding the other error
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\0\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Detected memory leaks!
is related to pytorch and nvidia and it only affects some windows machines. It is explained here: https://stackoverflow.com/questions/64837376/how-to-efficiently-run-multiple-pytorch-processes-models-at-once-traceback
TL;DR: Windows and Linux treat multiprocessing in python differently, and in windows each process commits much more memory, especially when using pytorch.
We use the script suggested in the link to mitigate the problem, but it could be that for some machines memory is still insufficient. Does that make sense in your case?
____________
|
|
|
|
Thank you abouh for responding,
I looked through my saved messages from the task to see if there was anything else I could find that might be of value and couldn't find anything.
In regard to the "out of memory" error, I tried to read through the stackoverflow link about the memory error. It is way above my level of technical expertise at this point, but it seemed like the amount of nvidia memory might have something to do with it. I am using an antique GTX970 card. It's old but still works.
Good luck coming up with a solution. If you want me to do any more testing, please let me know. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Seems like here are some possible workarounds:
https://github.com/Spandan-Madan/Pytorch_fine_tuning_Tutorial/issues/10
basically, two users mentioned
I think I managed to solve it (so far). Steps were:
1)- Windows + pause key
2)- Advanced system settings
3)- Advanced tab
4)- Performance - Settings button
5)- Advanced tab - Change button
6)- Uncheck the "Automatically... BLA BLA" checkbox
7)- Select the System managed size option box.
and
If it's of any value, I ended up setting the values into manual and some ridiculous amount of 360GB as the minimum and 512GB for the maximum. I also added an extra SSD and allocated all of it to Virtual memory. This solved the problem and now I can run up to 128 processes using pytorch and CUDA.
I did find out that every launch of Python and pytorch, loads some ridiculous amount of memory to the RAM and then when not used often goes into the virtual memory.
Maybe it can be helpful for someone
____________
|
|
|
bibiSend message
Joined: 4 May 17 Posts: 14 Credit: 14,957,460,267 RAC: 38,617,332 Level
Scientific publications
|
Hi abouh,
is there a commandline like
7za.exe pythongpu_windows_x86_64__cuda102.tar.gz
without -y to get pythongpu_windows_x86_64__cuda102.tar ?
|
|
|
WR-HW95Send message
Joined: 16 Dec 08 Posts: 7 Credit: 1,436,322,786 RAC: 338,160 Level
Scientific publications
|
So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431
RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125
All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The command line
7za.exe pythongpu_windows_x86_64__cuda102.tar.gz
works fine if the job is executed without interruptions.
However, in case the job is interrupted and restarted later, the command is executed again. Then, 7za needs to know whether or not to replace the already existing files with the new ones.
The flag -y is just to make sure the script does not get stuck in that command prompt waiting for an answer.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Unfortunately recent versions of PyTorch do not support all GPU's, older ones might not be compatible...
Regarding this error
RuntimeError: CUDA out of memory. Tried to allocate 446.00 MiB (GPU 0; 11.00 GiB total capacity; 470.54 MiB already allocated; 8.97 GiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
22:40:37 (12736): python.exe exited; CPU time 3346.203125
does it happen recurrently in the same machine? or depending on the job?
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
So whats going on here?
https://www.gpugrid.net/workunit.php?wuid=27228431
All kinds of errors on other tasks from too old card (1080ti) to out of ram.
Atm. commit charge is 70Gb and ram usage is 22Gb of 64Gb.
The problem is not with the card but with the Windows environment.
I have no issues running the Python on GPU tasks in Linux on my 1080 Ti card.
https://www.gpugrid.net/results.php?hostid=456812 |
|
|
|
Well so far, these new python WU's have been consistently completing and even surviving multiple reboots, OS kernel upgrades, and OS upgrades:
Kernels --> 5.17.13
OS Fedora35 --> Fedora36
3 machines w/GTX-1060 510.73.05 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.
Very nice compared to the acemd3/4 tasks which will error out under similar circumstance.
The Python tasks create and reread checkpoints very well. Upon restart the task will show 1% completion but after a while jump forward to the point that the task was stopped, exited or suspended and continue on till the finish. |
|
|
|
Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.
Good to know as I did not try a driver update or using a different GPU on a WU in progress.
I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I haven't had any reason to carry a cache. I have my cache level set at only one task for each host as I don't want GPUGrid to monopolize my hosts and compete with my other projects.
That said, I haven't gone 12 hours without a Python task on every host at all times. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes, one nice thing about the Python gpu tasks is that they survive a reboot and can be restarted on a different gpu without erroring.
Good to know as I did not try a driver update or using a different GPU on a WU in progress.
I do think BOINC needs to patch their estimated time to completion. XXXdays remaining makes it impossible to have any in a cache.
BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.
The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu. |
|
|
|
BOINC would have to completely rewrite that part of the code. The fact that these tasks run on both the cpu and gpu makes them impossible to decipher by BOINC.
The closest mechanism is the MT or multi-task category but that only knows about cpu tasks which run solely on the cpu.
I think BOINC uses the CPU excluively for their Estimated Time to Completion algorithm all WU's including those using a GPU which makes sense since the job cannot complete until both processor's work are complete. Observing GPU work with E@H, it appears that the GPU finishes first and the CPU continues for a period of time to do what is necessary to wrap the job up for return and those BOINC ETC's are fairly accurate.
It is the multi-thread WU's mentioned that appears to be throwing a monkey wrench at the ETC like these python jobs. From my observations, the python WU's use 32 processes regardless of actual system configuration. I have 2 Ryzen 16 core and my old FX-8350 8 core and they each run 32 processes each WU. It seems to me that the existing algorithm could be used in a modular fashion by assuming a single thread CPU job for the MT WU then calculating the estimated time and then knowing the number of processes the WU is requesting compared with those available from the system, it could perform a simple division and produce a more accurate result for MT WU's as well. Don't know for sure, just speculating but I do have the BOINC source code and might take a look and see if I can find the ETC stuff. Might be interseting. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
The server code for determining the ETC for MT tasks also has to account for task scheduling.
If it was adjusted as you suggest, anytime a Python task would run on the host, the server would proclaim it severely overcommitted and prevent any other work from running or worse, would actually prevent the Python task from running as it prevents other work from running from other projects in accordance with resource share and round-robin scheduling algorithm in the server and client.
It is a mess already with MT work, I believe it would be even worse accounting for these mixed platform cpu-gpu Python tasks.
But go ahead and look at the code. Also you should raise an issue on BOINC's Github repository so that the problem is logged and can be tracked for progress. |
|
|
|
You make a good point regarding the server side issues. Perhaps the projects themselves, if not already, would submit desired resources to allow the server to compare with those available on clients similar to submitting in house cluster jobs. I also agree that it is probably best to go through BOINC git and get a request for a potential fix but I also want to see their ETC algorithms just out of curiousity, both server and client. Nice interesting discussion. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You need to review the code in the /client/work_fetch.cpp module and any of the old closed issues pertaining to use of max_concurrent statements in app_config.xml.
I've have posted many conversations on this issue and collaborated with David Anderson and Richard Haselgrove to understand the issue and have seen at least six attempts to fix the issue once and for all.
A very complicated part of the code. You might also want to review many of the client emulator bug-fix runs done on this topic.
https://boinc.berkeley.edu/sim_web.php
The meat of the issue was in PR's #2918 #3001 #3065 #3076 #4117 and #4592
https://github.com/BOINC/boinc/pull/2918
Focus on the round-robin scheduling part of the code. |
|
|
|
Thank you Keith, much appreciated background and starting points. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
need advice with regard to running Python on one of my Windows machines:
One one of the Windows systems with a GTX980Ti, CPU Intel i7-4930K, 32GB RAM, Python runs well.
GPU memory usage is almost constant at 2.679MB, system memory usage varies between ~1.300MB and ~5.000MB. Task runtime between ~510.000 and ~530.000 secs.
Other Windows system with two RTX3070, CPU Intel i9-10900KF, 64GB RAM out of which 32GB are used for Ramdisk, leaving 32GB system RAM.
When trying to download Python tasks, BOINC event log says that some 22GB more RAM are needed.
How come?
From what I see from the other machine, Python uses between 1.3GB and 5GB RAM.
What can I do in order to get the machine with the two RTX3070 download and crunch Python tasks? |
|
|
|
BOINC event log says that some 22GB more RAM are needed.
Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
BOINC event log says that some 22GB more RAM are needed.
Could you post the exact text of the log message and a few lines either side for context? We might be able to decode it.
here is the text of the log message:
26.06.2022 09:20:35 | GPUGRID | Requesting new tasks for CPU and NVIDIA GPU
26.06.2022 09:20:37 | GPUGRID | Scheduler request completed: got 0 new tasks
26.06.2022 09:20:37 | GPUGRID | No tasks sent
26.06.2022 09:20:37 | GPUGRID | No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.
26.06.2022 09:20:37 | GPUGRID | Project requested delay of 31 seconds
the reason why at this point it says I have 10.982MB available is because I currently have some LHC projects running which use some RAM.
However, it also says: I need 33.378MB RAM; so my 32GB RAM are not enough anyway (as seen on the other machine, on which I also have 32GB RAM, and there is no problem with downloading and crunching Python).
What I am surprised about is that the projects request so much free RAM, alhough while in operation, it uses only between 1.3 and 5GB.
|
|
|
|
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.
Disk, not RAM. Probably one or other of your disk settings is blocking it. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
26.06.2022 09:20:37 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 22396.05MB more disk space. You currently have 10982.55 MB available and it needs 33378.60 MB.
Disk, not RAM. Probably one or other of your disk settings is blocking it.
Oh sorry, you are perfectly right. My mistake, how dumm :-(
so, with my 32GB Ramdisk it does not work, when it says that it needs 33378MB.
What I could do, theoretically, is to shift BOINC from the Ramdisk to the 1 GB SSD. However, the reason why I installed BOINC on the Ramdisk was that the LHC Atlas tasks which I am crunching permanently have an enormous disk usage, and I don't want ATLAS to kill the SSD too early.
I guess that there might be ways to install a second instance of BOINC on the SSD - I tried this on another PC years ago, but somehow I did not get it done properly :-( |
|
|
|
You'll need to decide which copy of BOINC is going to be your 'primary' installation (default settings, autorun stuff in the registry, etc.), and which is going to be the 'secondary'.
The primary one can be exactly what is set up by the installer, with one change. The easiest way is to add the line
<allow_multiple_clients>1</allow_multiple_clients>
to the options section of cc_config.xml (or set the value to 1 if the line is already present). That needs a client restart if BOINC's already running.
Then, these two batch files work for me. Adapt program and data locations as needed.
To run the client:
D:\BOINC\rh_boinc_test --allow_multiple_clients --allow_remote_gui_rpc --redirectio --detach_console --gui_rpc_port 31418 --dir D:\BOINCdata2\
To run a Manager to control the second client:
start D:\BOINC\boincmgr.exe /m /n 127.0.0.1 /g 31418 /p password
Note that I've set this up to run test clients alongside my main working installation - you can probably ignore that bit. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
We have a time estimation problem, discussed previously in the thread. As Keith mentioned, the real walltime calculation should be much less than reported.
It would be very helpful if you could let us know if that is the case. In particular, if you are getting 75000 credits per jobs means the jobs are getting 25% extra credits for returning fast.
Are you still in need of that? My first Python ran for 12 hours 55 minutes according to BoincTasks, but the website reported 156,269.60 seconds (over 43 hours). It got 75,000 credits.
http://www.gpugrid.net/results.php?hostid=593715 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks for the feedback Jim1348! It is useful for us to confirm that jobs run in a reasonable time despite the wrong estimation issue. Maybe that can be solved somehow in the future. Seems like at least did no estimate dozens of days like I have seen in other occasions.
____________
|
|
|
|
it's because the app is using the CPU time instead of runtime. since it uses so many threads, it adds up the time spent on all the threads. 2 threads working for 1hr total would be 2hrs reported CPU time. you need to track wall clock time. the app seems to have this capability since it reports timestamps of start and stop in the stderr.txt file.
also credit reward is static, and should be a more dynamic scheme like the acemd3 tasks. look at Jim's tasks you have tasks with a 2,000 - 150,000 seconds (reported) all with the same 75,000 credit reward. good reward for the 2,000s runs, but painfully low for the longer ones (the majority).
____________
|
|
|
|
There are two separate problems with timing.
There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks.
And there's the estimation of anticipated runtime when a task is first issued, before it's even started to run. I would have thought that would have started to correct itself by now: with the steady supply of work recently, we will have got well past all the trigger points for the server averaging algorithms.
Next time I see a task waiting to run, I'll trap the numbers and try to make sense of them. |
|
|
|
There's the display of CPU time instead of elapsed time on the website - that's purely cosmetic, as we report the correct elapsed time for the finished tasks.
that may be true, NOW. however, if they move to a dynamic credit scheme (as they should) that awards credit based on flops and runtime (like ACEMD3 does), then the runtime will not be just cosmetic.
____________
|
|
|
|
OK, I got one on host 508381. Initial estimate is 752d 05:26:18, task is 32940037
Size:
<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est>
Speed:
<flops>707646935000.048218</flops>
DCF:
<duration_correction_factor>45.991658</duration_correction_factor>
App_ver:
<app_name>PythonGPU</app_name>
<version_num>403</version_num>
Host details:
Number of tasks completed 80
Average processing rate 13025.358204684
Calculated time estimate (size / speed):
1413134.079355548 [seconds]
16.355718511 [days - raw]
752.226612105 [days - adjusted by DCF]
So my client is doing the calculations right.
The glaring difference is between flops and APR.
Re-doing the {size / speed} calculation with APR gives
76773.320494203 [seconds]
21.32592236 [hours]
which is a little high for this machine, but not bad. The last 'normal length' tasks ran in about 14 hours.
So, the question is: why is the server tracking APR, but not using it in the <app_version> blocks sent to our machines? |
|
|
|
Yesterday's task is just in the final stages - it'll finish after about 13 hours - and the next is ready to start. So here are the figures for the next in the cycle.
Initial estimate: 737d 06:19:25
<flops>707646935000.048218</flops>
<duration_correction_factor>45.076802</duration_correction_factor>
Average processing rate 13072.709605774
So, APR and DCF have both made a tiny movement in the right direction, but flops has remained stubbornly unchanged. And that's the one that controls the initial estimates.
(actually, a little short one crept in between the two I'm watching, so it's two cycles - but that doesn't change the principle) |
|
|
roundup Send message
Joined: 11 May 10 Posts: 63 Credit: 9,096,655,193 RAC: 54,351,602 Level
Scientific publications
|
The credits per runtime for cuda1131 really looks strange sometimes:
Task 27246643 2 Jul 2022 | 8:13:32 UTC 3 Jul 2022 | 8:20:56 UTC
Runtimes 445,161.60 445,161.60 Credits 62,500.00
Compare to this one:
Task 27246622 2 Jul 2022 | 7:55:03 UTC 2 Jul 2022 | 8:05:39 UTC Runtimes 2,770.92 2,770.92 Credits 75,000.00 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, you are right about that. There are 2 types of experiments I run now:
a) Normal experiments have tasks with a fixed target number of agent-environment interaction to process. The tasks finish once this number of interactions is reached. All tasks require the same amount of compute, then makes sense (at least to me) to reward them with the same amount of credits. Even if some tasks are completed in less time due to faster hardware.
b) I have recently introduced an "early stopping" mechanism to some experiments. The upper bound is the same as in the other type of experiments, a fixed amount of agent-environment interactions. However, if the agent discovers interesting results before that, it returns so this information can be shared with other agent in the population of AI agents. Which agents will finish earlier and how much earlier is random, so it would be interesting to adjust the credits dynamically, yes. I will ask acemd3 people how to do it.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The credit system gives 50.000 credits per task. However, completion before a certain amount of time multiplies this value by 1.5, then by 1.25 for a while and finally by 1.0 indefinitely. That explains why sometimes you see 75.000 and sometimes 62.500 credits.
____________
|
|
|
Toby BroomSend message
Joined: 11 Dec 08 Posts: 25 Credit: 431,137,443 RAC: 5,557,343 Level
Scientific publications
|
I had a idea after reading some of the post about utilisation of resources.
For the power user here we tend to have high end hardware on the project so would it be possible to support our hardware fully e.g I imagine that’s if you have 10-24 GB of VRAM the whole simulation could be loaded in to VRAM giving additional performance to the project?
Additionally the more modern cards have more ML focused hardware accelerated features so are they well utilised?
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.
There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet...
I am not sure if I am answering your question, please let me know if I am not.
____________
|
|
|
Toby BroomSend message
Joined: 11 Dec 08 Posts: 25 Credit: 431,137,443 RAC: 5,557,343 Level
Scientific publications
|
Thanks for the comments, what about using large quantity of VRAM if available, the latest BOINC finally allows for correct reporting VRAM on NVidia cards so you can tailor the WUs based on VRAM to protect the contributions from users with lower specification computers. |
|
|
FritzBSend message
Joined: 7 Apr 15 Posts: 12 Credit: 2,769,441,100 RAC: 3,409,357 Level
Scientific publications
|
Sorry for OT, but some people need admin help and I've seen one beeing active here :)
Password reset doesn't work and there seems to be an alternative method some years ago. Maybe this can be done again?
Please have a look in this thread: http://www.gpugrid.net/forum_thread.php?id=2587&nowrap=true#58958
Thanks!
Fritz |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hi Fritz! Apparently the problem is that sending emails from server no longer works. I will mention the problem to the server admin.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I talked to the server admin and he explained to me the problem in more detail.
The issue comes from the fact that the GPUGrid server uses a public IP from the Universitat Pompeu Fabra, so we have to comply with the data protection and security policies of the university. Among other things this implies that we can not send emails from our web server.
Therefore, unfortunately that prevents us from fixing the password recovery problem.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello Toby,
For the python app, do you mean executing a script that automatically detects how much memory has the GPU to which the the task has been assigned, and then flexibly define an agent that uses it all (or most of it)? In other words, flexibly adapt to the host machine capacity.
The experiments we are running at the moment require training AI agents in a sequence of jobs (i.e. starting to training an agent in a GPUGrid job, then sending it back to the server to evaluate its capabilities, then send another job that loads the same agent and continues its training, evaluate again, etc)
Consequently, current jobs are designed to work with a fixed amount of GPU memory, and we can not set it too high since we want a high percentage of hosts the be able to run them.
However, it is true that by doing that we are sacrificing resources in GPUs with larger amounts of memory. You gave me something to think about, there could be situations is which could make sense to use this approach and indeed would be a more efficient use of resources.
____________
|
|
|
Toby BroomSend message
Joined: 11 Dec 08 Posts: 25 Credit: 431,137,443 RAC: 5,557,343 Level
Scientific publications
|
BOINC can detect the quantity of GPU memory, it was bugged in the older BOINC version for nVidia cards but in 7.20 its fixed so there would be no need to detect in Python as its already in the project database.
A variable job size, yes.
Its more work for you but I can imagine there could be performance boost? Too keep it simple you could have S,M,L with say <4, 4-8, >8? the GPUs with more than 8 could be larger in general as only the top tier GPU's have this much VRAM.
It seems BOINC knows how to allocate to suitable computers. Worst case you could make it opt in.
|
|
|
JohnMD Send message
Joined: 4 Dec 10 Posts: 5 Credit: 26,860,106 RAC: 0 Level
Scientific publications
|
Even video cards with 6GiB crash with insufficient VRAM.
The app is apparently not aware of available resources.
This ought to be the first priority before sending tasks to the world. |
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,569,300,388 RAC: 3,786,488 Level
Scientific publications
|
From what we are finding right now the 6GB GPUs would have sufficient VRAM to run the current Python tasks. Refer to this thread noting between 2.5 and 3.2 GB being used:https://www.gpugrid.net/forum_thread.php?id=5327
If jobs running on GPUs with 4GB or more are crashing, then there is a different problem. Have to look at the logs to see what's going on.
It's more likely they are running out of system memory or swap space but there are a few that are failing from an unknown cause.
I took a quick look at the jobs you have which errored and I found the mx150 and mx350 GPUs only have 2GB VRAM. These are not sufficient to run the Python app.
Unfortunately I would suggest you use these GPUs for another project they are more suited for.
|
|
|
|
New generic error on multiple tasks this morning:
TypeError: create_factory() got an unexpected keyword argument 'recurrent_nets'
Seems to affect the entire batch currently being generated. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks for letting us know Richard. It is a minor error, sorry for the inconvenience, I am fixing it right now. Unfortunately the remaining jobs of the batch will crash but then will be replaced with correct ones.
____________
|
|
|
|
No worries - these things happen. The machine which alerted me to the problem now has a task 'created 28 Jul 2022 | 10:33:04 UTC' which seems to be running normally.
The earlier tasks will hang around until each of them has gone through 8 separate hosts, before your server will accept that there may have been a bug. But at least they don't waste much time. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes exactly, it has to fail 8 times... the only good part is that the bugged tasks fail at the beginning of the script so almost no computation is wasted. I have checked and some of the tasks in the newest batch have already finished successfully.
____________
|
|
|
|
A peculiarity of Python apps for GPU hosts 4.03 (cuda1131):
If BOINC is shut down while such a task is in progress, then restarted, the task will show 2% progress at first, even if it was well past this before the shutdown.
However, the progress may then jump past 98% at the next time a checkpoint is written, which looks like the hidden progress is recovered.
Not a definite problem, but you should be aware of it. |
|
|
|
I've been monitoring and playing with the initial runtime estimates for these tasks.
The Y-axis has been scaled by various factors of 10 to make the changes legible.
The initial estimates (750 days to 230 days) are clearly dominated by the DCF (real numbers, unscaled).
The <flops> - the speed of processing, 707 or 704 GigaFlops, assumed by the server. There's a tiny jump midway through the month, which correlates with a machine software update, including a new version of BOINC, and reboot. That will have triggered a CPU benchmark run.
The DCF (client controlled) has been falling very, very, slowly. It's so far distant from reality that BOINC moves it at an ultra-cautious 1% of the difference at the conclusion of each successful run. The changes in slope come about because of the varying mixture of short-running (early exit) tasks and full-length tasks.
The APR has been wobbling about, again because of the varying mixture of tasks, but seems to be tracking the real world reasonably well. The values range from 13,000 to nearly 17,000 GigaFlops.
Conclusion:
The server seems to be estimating the speed of the client using some derivative of the reported benchmark for the machine. That's absurd for a GPU-based project: the variation in GPU speeds is far greater than the variation of CPU speeds. It would be far better to use the APR, but with some safeguards and greater regard to the actual numbers involved.
The chart was derived from host 508381, which has a measured CPU speed of 7.256 GigaFlops (roughly one-tenth of the speed assumed by the server), and all tasks were run on the same GTX 1660 Ti GPU, with a theoretical ('peak') speed of 5,530 GigaFlops. Congratulations to the GPUGrid programmers - you've exceeded three times the speed of light (according to APR)!
More seriously, that suggests that the 'size' setting for these tasks (fpops_est) - the only value that project actually has to supply manually - is set too low. This may have been the point at which the estimates started to go wrong.
One further wrinkle: BOINC servers can't fully allow for varying runtimes and early task exits. Old hands will remember the problems we had with 'dash-9' (overflow) tasks at SETI@home. We overcame that one by adding an 'outlier' pathway to the server code: if the project validator marks the task as an outlier, its runtime is disregarded when tracking APR - that keeps things a lot more stable. Details at https://boinc.berkeley.edu/trac/wiki/ValidationSimple#Runtimeoutliers |
|
|
|
or just use the flops reported by BOINC for the GPU. since it is recorded and communicated to the project. and from my experience (with ACEMD tasks) does get used in the credit reward for the non-static award scheme. so the project is certainly getting it and able to use that value.
____________
|
|
|
|
Except:
1) A machine with two distinct GPUs only reports the peak flops of one of them. (The 'better' card, which is usually - but not always - the faster card).
2) Just as a GPU doesn't run at 10x the speed of the host CPU, it doesn't run realistic work at peak speed, either. That would involve yet another semi-realistic fiddle factor. And Ian will no doubt tell me that fancy modern cards, like Turing and Ampere, run closer to peak speed than earlier generations.
We need to avoid having too many moving parts - too many things to get wrong when the next rotation of researchers takes over. |
|
|
|
personally I'm a big fan of just standardizing the task computational size and assigning static credit. no matter the device used or how long it takes. just take flops out of the equation completely. that way faster devices get more credit/RAC based on the rate in which valid tasks are returned.
the only caveat is the need to make all the tasks roughly the same "size" computationally. but that seems easier than all the hoops to jump through to accommodate all the idiosyncrasies of BOINC, various systems, and task differences.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
The latest Python tasks I've done today have awarded 105,000 credits as compared to all the previous tasks at 75,000 credits.
Looking back from the job_log, the estimated computation size has been at 1B GFLOPS for quite a while now.
Nothing has changed in the current task parameters as far as I can tell.
Estimated computation size
1,000,000,000 GFLOPs
So I assume that Abouh has decided to award more credits for the work done.
Anyone notice this new award level?
They are generally taking longer to crunch than the previous ones, so maybe it is just scaling. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Anyone notice this new award level?
I just got my first one.
http://www.gpugrid.net/workunit.php?wuid=27270757
But not all the new ones receive that. A subsequent one received the usual 75,000 credit. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Thanks for your report. It doesn't really track with scaling now that I examine my tasks.
Some are getting the new higher reward for 2 hours of computation but some are still getting the lower reward for 8 hours of computation.
I was getting what was the standard reward for tasks taking as little as 20 minutes of computation time. So the 75K was a little excessive in my opinion.
These new ones are trending at 2-3 hours of computation time. But I also had one take 11 hours and was still rewarded with only the 105K.
Maybe we are finally getting into the meat of the AI/ML investigation after all the initial training we have been doing.
Still sitting on 3 new acemd3 tasks that haven't been looked at for two days and will only get the standard reward since the client scheduler feels no need to push them to the front since their APR and estimated completion times are correct and reasonable. Really would like to get the Python tasks to get realistic APR's and estimated completion times. But since they are predominately a cpu task with a little bit of gpu computation, BOINC has no clue how to handle them.
Maybe Abouh can post some insight as to what the current investigation is doing. |
|
|
|
My first 'high rate' task (105K credits) was a workunit created at 10 Aug 2022 | 2:03:51 UTC.
Since then, I've only received one 75K task: my copy was issued to me at 10 Aug 2022 | 21:15:47 UTC, but the underlying workunit was created at 9 Aug 2022 | 13:44:09 UTC - I got a resend after two previous failures by other crunchers.
My take is that the 'tariff' for GPUGrid tasks is set when the underlying workunit is created, and all subsequent tasks issued from that workunit inherit the same value. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
That implies the current release candidates are being assigned 105K credit based I assume on harder to crunch datasets.
Don't think it depends on a recent release date either. I just had a 12 August _0 created task and it only awarded 75K after passing through one other before I got it. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
Which apps are running these days? The apps page is missing the column that shows how much is running: https://www.gpugrid.net/apps.php
How many CPU threads do I need to run to finish Python WUs in a reasonable time for say an i9-9980XE?
Trying to update my app_config to give a it a go. The last one I found was pretty old. Here's what I've cobbled together. Suggestions welcome.
<app_config>
<!-- i9-10980XE 18c36t 32 GB L3 Cache 24.75 MB -->
<app>
<name>acemd3</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>
<app>
<name>acemd4</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>1.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<fraction_done_exact/>
</app>
<app>
<name>PythonGPU</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>4.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>PythonGPUbeta</name>
<plan_class>cuda1121</plan_class>
<gpu_versions>
<cpu_usage>4.0</cpu_usage>
<gpu_usage>1.0</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
<app>
<name>Python</name>
<plan_class>cuda1121</plan_class>
<cpu_usage>4</cpu_usage>
<gpu_versions>
<cpu_usage>4</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<app_version>
<app_name>PythonGPU</app_name>
<avg_ncpus>4</avg_ncpus>
<ngpus>1</ngpus>
<cmdline>--nthreads 4</cmdline>
</app_version>
<fraction_done_exact/>
<max_concurrent>1</max_concurrent>
</app>
</app_config> |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I get away with only reserving 3 cpu threads. That does not impact or affect what the actual task does when it runs. Just BOINC cpu scheduling for other projects.
It will always spawn 32 independent python processes when running.
And you really should update or remove the plan class statements for Python on GPU since your plan_class is incorrect.
Current plan_class is cuda1131 NOT cuda1121
You also can clean up your app_config as there only is PythonGPU application. No Python or PythonGPUBeta application. |
|
|
|
Hi, guys!
I have not particularly followed Python GPU app (for Windows) and this thread, so perhaps this issue has already been discussed somewhere on the forum.
It seems I only tried once, and all tasks I received crashed almost immediately after start.
I was surprised that at WU's starting, limit on Virtual memory(Commit Charge) in the system was reached.
Today I tried to understand the problem in more detail and was surprised again to find that application addresses ~ 42 GiB Virtual Memory in total!
At the same time, the total consumption of Physical Memory is about 4 times less (~ 10 GiB).
For example
So the question is - is that intended?..
I had to create a 30 GiB swap file to cover this difference so that I could run something else on the system besides one WU of Python GPU -_- |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Hi, guys!
I have not particularly followed Python GPU app (for Windows) and this thread, so perhaps this issue has already been discussed somewhere on the forum.
It seems I only tried once, and all tasks I received crashed almost immediately after start.
I was surprised that at WU's starting, limit on Virtual memory(Commit Charge) in the system was reached.
Today I tried to understand the problem in more detail and was surprised again to find that application addresses ~ 42 GiB Virtual Memory in total!
At the same time, the total consumption of Physical Memory is about 4 times less (~ 10 GiB).
For example
So the question is - is that intended?..
I had to create a 30 GiB swap file to cover this difference so that I could run something else on the system besides one WU of Python GPU -_-
Yes, because of flaws in Windows memory management, that effect cannot be gotten around. You need to increase the size of your pagefile to the 50GB range to be safe.
Linux does not have the problem and no changes are necessary to run the tasks.
The project primarily develops Linux applications first as the development process is simpler. Then they tackle the difficulties of developing a Windows application with all the necessary workarounds.
Just the way it is. For the reason why read this post.
https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908 |
|
|
|
Thank you for clarification.
I was not familiar with subtleties of the memory allocation mechanism in Windows.
That was useful.
And I already increase swap to RAM value(64GB) to be sure ;)
Upd.
And the reward system for this app clearly begs for revision... : /
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Task credits are fixed. Pay no attention to the running times. BOINC completely mishandles that since it has no recognition of the dual nature of these cpu-gpu application tasks.
They should be thought of as primarily a cpu application with a little gpu use thrown in occasionally.
[Edit] Look at the delta between sent time and returned time to determine the actual runtime that the task took.
In your example, the first listed task took only 20 minutes to finish, the second took 4 1/2 hours and the last took 4 hours. it all depends on the different parameter sets for each task that is the criteria for the reinforcement learning on the gpu. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Can anyone tell me what happened to this task:
https://www.gpugrid.net/result.php?resultid=32997605
which failed after 301.281 seconds :-((( |
|
|
|
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.
It's possibly the Windows swap file settings, again. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.
It's possibly the Windows swap file settings, again.
thanks Richard for the quick reply.
I now changed the page file size to max. 65MB.
I did it on both drives: system drive C:/ and drive F:/ (on separate SSD) on which BOINC is running.
Probably to change it for only one drive would have been okay, right? If so, which one?
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
The Windows one. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I am a bit surprised that I am able to run the pythons without problem under Ubuntu 20.04.4 on a GTX 1060. It has 3GB of video memory, and uses 2.8GB thus far. And the CPU is currently running two cores (down from the previous four cores), using about 3.7GB of memory, though reserving 19 GB.
Even on Win10, my GTX 1650 Super has had no problems, though it has 4GB of memory and uses 3.6GB. But I have 32GB system memory, and for once I let Windows manage the virtual memory itself. It is reserving 42GB. I usually set it to a lower value. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
The Windows one.
thx :-) |
|
|
Toby BroomSend message
Joined: 11 Dec 08 Posts: 25 Credit: 431,137,443 RAC: 5,557,343 Level
Scientific publications
|
Can the CPU usage be adjusted correctly? its fine to use a number of cores but currently it say less than one and uses more than 1 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello! sorry for the late reply
I adjusted the maximum length of some of the tasks and consequently also adjusted the credits for completing them. What I mean by that is that each one of my tasks contains an agent interacting with its environment and learning from a fixed number of total interaction steps. Previously I set that number to 25M steps. Now I increased it to 35M for some tasks and consequently also increased the reward.
This increase in the number of steps does not necessarily increase the completion time of the task, because if an agent discovers something relevant before reaching the maximum number of steps, the task ends and the “new information” is sent back to be shared with the other agents in the population. Whether that happens or not is random, but on average the task completion time will increase a bit due to the ones that reach 35M steps, so the reward has to increase as well. This change does not affect hardware requirements.
This randomness also explains why some tasks are shorter but still receive the same reward (credits per task are fixed). However, the average credit reward should be similar for all hosts as they solve more and more tasks. Also the average task completion time should remain stable.
As I have mentioned, I work with populations of AI agents that try to cooperatively solve a single complex problem. Note that as more things are discovered by agents in a population the harder it becomes to keep discovering new ones. In general, early tasks in an experiment return quite fast, while as the experiment progresses the 35M steps mark gets hit more and more often (and tasks take longer to complete).
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
current value of rsc_fpops_est is 1e18, with 10e18 as limit. I remember we had to increase it because otherwise produced false “task aborted by host” from some users side. Do you think we should change it again?
Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...
____________
|
|
|
|
Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...
This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless.
I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow.
The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that.
I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research.
I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Regarding cpu_usage, I remember having this discussion with Toni and I think the reason why we set the number of cores to that number is because with a single core the jobs can actually be executed. Even if they create 32 threads. Definitely do not require 32 cores. Is there an advantage of setting it to an arbitrary number higher than 1? Couldn't that cause some allocation problems? sorry it is a bit outside of my knowledge zone...
This is a consequence of the handling of GPU plan_classes in the released BOINC server code. In the raw BOINC code, the cpu_usage value is calculated by some obscure (and, in all honesty, irrelevant and meaningless) calculation of the ratio of the number of flops that will be performed on the CPU and the GPU - the GPU, in particular, being assumed to be processing at an arbitrary fraction of the theoretical peak speed. In short, it's useless.
I don't think the raw BOINC code expects you to make manual alterations to the calculated value. If you've found a way of over-riding and fixing it - great. More power to your elbow.
The current issue arises because the Python app is neither a pure GPU app, nor a pure multi-threaded CPU app. It operates in both modes - and the BOINC developers didn't think of that.
I think you need to create a special, new, plan_class name for this application, and experiment on that. Don't meddle with the existing plan_classes - that will mess up the other GPUGrid lines of research.
I'm running with a manual override which devotes the whole GPU power, plus 3 CPUs, to the Python tasks. That seems to work reasonably well: it keeps enough work from other BOINC projects off the CPU while Python is running.
Could you tell us a bit more about this manual override? Just now it is sprawled over five cores, ten threads. If it sees the sixth core free, it grabs that one also. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed.
Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC.
The file minimally just needs this:
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use.
I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
If you run other projects concurrently, then it is adviseable to limit the number of cores the Python tasks occupies for scheduling. I am not talking about the number of threads each task uses since that is fixed.
Just create an app_config.xml file and place it into the GPUGrid projects directory and either re-read config files from the Manager or just restart BOINC.
The file minimally just needs this:
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
This will tell the BOINC client not to overcommit other projects cpu usage as the Python app gets 3 cores reserved for its use.
I have found that to be plenty even when running 95% of all cpu cores on 3 other cpu projects along with running 2 other gpu projects which also use some or all of a cpu core to process the gpu task.
Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN? |
|
|
|
Thank you Keith. Why is it using so many cores plus is it something like OpenIFS on CPDN?
Yes - or nbody at MilkyWay. This Python task shares characteristics of a cuda (GPU) plan class, and a MT (multithreaded) plan class, and works best if treated as such. |
|
|
|
Possible bad workunit: 27278732
ValueError: Expected value argument (Tensor of shape (1024,)) to be within the support (IntegerInterval(lower_bound=0, upper_bound=17)) of the distribution Categorical(logits: torch.Size([1024, 18])), but found invalid values:
tensor([ 7, 9, 7, ..., 10, 9, 3], device='cuda:0') |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Interesting I had never seen this error before, thank you!
____________
|
|
|
Toby BroomSend message
Joined: 11 Dec 08 Posts: 25 Credit: 431,137,443 RAC: 5,557,343 Level
Scientific publications
|
Thanks Richard, is 3 CPU cores enough to not slow down the GPU? |
|
|
|
I'm noticing an interesting difference in application behavior between different systems. abouh, can you help explain the reason?
I can see that each running task will spawn 32x processes (multiprocessing.spawn) as well as [number of cores]x processes for the main run.py application.
so on my 8-core/16-thread Intel system, a single running task spawns 8x run.py processes, and 32x multiprocessing.spawn threads.
and on my 24-core/48-thread AMD EPYC system, a single running task spawns 24x run.py processes, and 32x multiprocessing.spawn threads.
What is confusing is the utilization of each thread between these systems.
the EPYC system is uses ~600-800% CPU for the run.py process (~20-40% each thread)
whereas the Intel system uses ~120% CPU for the run.py process (~2-5% each thread)
I replicated the same high CPU use on another EPYC system (in a VM) where I've constrained it to the same 8-core/16-thread, and again its using a much larger share of the CPU than the intel system.
is the application coded in some way that will force more work to be done on more modern processors? as far as I can tell, the increased CPU use isnt making the overall task run any faster. the Intel system is just as productive with far less CPU use.
I was trying to run some python tasks on my Plex VM to let it use the GPU since plex doesnt use it very much, but the CPU use is making it troublesome.
____________
|
|
|
|
or perhaps the Broadwell based Intel CPU is able to hardware accelerate some tasks that the EPYC has to do in software, leading to higher CPU use?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The application is not coded in any specific way to force more work to be done on more modern processors.
Maybe python handles it under the hood somehow?
____________
|
|
|
|
Maybe python handles it under the hood somehow?
it might be related to pytorch actually. I did some more digging and it seems like AMD has worse performance due to some kind of CPU detection issue in the MKL (or maybe deliberate by Intel). do you know what version of MKL your package uses?
and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable.
____________
|
|
|
|
and are you able to set specific env variables in your package? if your MKL is version <=2020.0, setting MKL_DEBUG_CPU_TYPE=5 might help this issue on AMD CPUs. but it looks like this will not be effective if you are on a newer version of the MKL as Intel has since removed this variable.
to add: I was able to inspect your MKL version as 2019.0.4, and I tried setting the env variable by adding
os.environ["MKL_DEBUG_CPU_TYPE"] = "5"
to the run.py main program, but it had no effect. either I didn't put the command in the right place (I inserted it below line 433 in the run.py script), or the issue is something else entirely.
edit: you also might consider compiling your scripts into binaries to prevent inquisitive minds from messing about in your program ;)
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Should the environment variable for fixing AMD computation in the MKL library be in the task package or just in the host environment? Or both?
I would have thought the latter as the system calls the MKL library is using eventually have to be passed through to the cpu.
export MKL_DEBUG_CPU_TYPE=5
and add to your .bashrc script.
So you need to set the OS environent variable up first then pass it through to the Python code with your os.environ("MKL_DEBUG_CPU_TYPE")
Of course if the embedded MKL package is the later version where the variable is ignored now, a moot point of using the variable to fix the intentional hamstringing of AMD processors.
[Edit]
Looks like there is a workaround for the Intel MKL check whether it is running on an Intel processor. https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html
So make the fake shared library and use LD_PRELOAD= to load the fake shared library
That might be the easiest method to get the math libraries to use the advanced SIMD instructions like AVX2. |
|
|
|
I didn’t explicitly state it in my previous reply. But I tried all that already and it didn’t make any difference. I even ran run.py standalone outside of BOINC to be sure that the env variable was set. Neither the env variable being set nor the fake Intel library made any difference at all.
But the embedded MKL version is actually an old one. It’s from 2019 as I mentioned before. So it should accept the debug variable. I just think now that it’s probably not the reason.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Ohh . . . . OK. Didn't know you had tried all the previous existing fixes.
So must be something else going on in the code I guess.
Just thought I would throw it out there in case you hadn't seen the other fixes. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster.
No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :)
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Don't know if the math functions being used by the Python libraries are any higher than SSE2 or not.
But if they are the MKL library functions default to SSE2 only when the MKL library is called and it detects any NON-Intel cpu.
Probably only way to know for sure is examine the code and see it tries to run any SIMD instruction higher than SSE2, then implement the fix and see if the computations on the cpu are sped up.
Depending on the math function being called, the speedup with the fix in place can be orders of magnitude faster.
Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster.
But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes.
|
|
|
|
I could definitely set the env variable depending on package version in my scripts if that made AI agents train faster.
No need to create binaries. I am fine with any user that feels like it tinkering with the code, it always provides useful information. :)
Was my location for the variable in the script right or appropriate? inserted below line 433. Does the script inherit the OS variables already? Just wanted to make sure I had it set properly. I figured the script runs in its own environment outside of BOINC (in Python). That’s why I tried adding it to the script.
____________
|
|
|
|
Based on Ian's experiment running on his Intel host, the lower cpu usage didn't make the tasks run any faster.
But less cpu usage per task (when the tasks run the same with either hi or lo cpu usage) would be beneficial when also running other cpu tasks and aren't taking resources away from those processes.
It’s hard to say whether it’s faster or not since it’s not a true apples to apples comparison. So far it feels not faster, but that’s against different CPUs and different GPUs. Maybe my EPYC system seems similarly fast because the EPYC is just brute forcing it. It had much higher IPC than the old Broadwell based Intel.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours.
How come that a runtime (and CPU time) of 1,354,433.00 secs (=376 hrs) is shown:
https://www.gpugrid.net/result.php?resultid=33030599
As a side effect, I did not get any credit bonus (in this case the one for finishing within 48 hrs). |
|
|
|
One of my machines started a Python task yesterday evening and finished it after about 24-1/ 2hours.
How come that a runtime (and CPU time) of 1,354,433.00 secs (=376 hrs) is shown:
https://www.gpugrid.net/result.php?resultid=33030599
As a side effect, I did not get any credit bonus (in this case the one for finishing within 48 hrs).
The calculated runtime is using the cpu time. Has been mentioned many times. It’s because more than one core was being used. So the sum of each core’s cpu time is what’s shown.
You did get 48hr bonus of 25%. Base credit is 70,000. You got 87,500 (+25%). Less than 24hrs gets +50% for 105,000.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
GPUGRID seems to have problems with figures, at least what concerns Python :-(
I just wanted to download a new Python task. On my Ramdisk there is about 59GB free disk space, but the BOINC event log tells me that Python needs some 532MB more disk space. How come? |
|
|
|
GPUGRID seems to have problems with figures, at least what concerns Python :-(
I just wanted to download a new Python task. On my Ramdisk there is about 59GB free disk space, but the BOINC event log tells me that Python needs some 532MB more disk space. How come?
probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that.
Options-> Computing Preferences...
Disk and Memory tab
and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
probably due to your allocation of disk usage in BOINC. go into the compute preferences and allow BOINC to use more disk space. by default I think it is set to 50% of the disk drive. you might need to increase that.
Options-> Computing Preferences...
Disk and Memory tab
and set whatever limits you think are appropriate. it will use the most restrictive of the 3 types of limits. The Python tasks take up a lot of space.
no, it isn't that.
I am aware of these setting. Since nothing else than BOINC is being done on this computer, disk and RAM usage are set to 90% for BOINC.
So, when I have some 58GB free on a 128GB RAM disk (with some 60GB free system RAM), it should normally be no problem for Python to download and being processed.
On another machine, I have a lot less ressources, and it works.
So no idea, what the problem is in this case ... :-(
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.
Could be BOINC only considers physical storage to be valid. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.
Could be BOINC only considers physical storage to be valid.
no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
another question -
I think I read something concerning this topic somewhere here, but I cannot find the posting any more (maybe though I am mistaken):
Is there the possibility to limit (by app_config.xml) the number of CPU cores Python is using?
The reason why I am asking is that on that machine onto which Python can be downloaded, I have also another project (not GPU) running, and when Python fills up the number of available cores, the CPU is busy with 100% which slows things down, and also heats up the CPU much more. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation.
If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks.
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.
You can do that through a app_config.xml file in the project directory.
Like this:
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config> |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
...
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.
You can do that through a app_config.xml file in the project directory.
Like this: ...
thanks, Keith, for your explanation.
Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python).
So I have no other choice than to live with the situation as is :-(
What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]). |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Or BOINC doesn't consider a RAM Disk a "real" drive and ignores the available storage there.
Could be BOINC only considers physical storage to be valid.
no, I have BOINC running on another PC with Ramdisk - in that case a much smaller one: 32GB
Now I tried once more to download a Python on my system with a 128GB Ramdisk (plus 128GB system RAM).
The BOINC event log says:
Python apps for GPU hosts needs 4590.46MB more disk space. You currently have 28788.14 MB available and it needs 33378.60 MB.
Somehow though all this does not fit together: in reality, the Ramdisk is filled with 73GB and has 55GB available.
Further, I am questioning whether Python indeed needs 33.378 MB free disk space for downloading?
I am really frustrated that this does not work :-(
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
...
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.
You can do that through a app_config.xml file in the project directory.
Like this: ...
thanks, Keith, for your explanation.
Well, I actually would not need to put in this app_config.xml as in my case; the other BOINC tasks don't just asign any number of CPU cores by themselves. I tell each of these projects by a seperate app_config.xml how many cores to use (which I was, in fact, also hoping for Python).
So I have no other choice than to live with the situation as is :-(
What is too bad though is that obviously there are no longer any ACEMD tasks being sent out (where it is basically clear: 1 task = 1 CPU core [unless changed by an app_config.xml]).
You are not understanding the nature of the Python tasks. They are not using all your cores. They are not using 32 cores. They are using 32 spawned processes
A process is NOT a core.
The Python task use from 100-300% of a cpu core depending on the speed of the host and the number of cores in the host.
That is why I offered the app_config.xml file to allot 3 cpu cores to each Python task for BOINC scheduling purposes. And you can have many app_config.xml files in play among all your projects as a app_config file is specific to each project and is placed into the projects folder. You certainly can use one for scheduling help for GPUGrid.
A app_config file does not control the number of cores a task uses. That is dependent soley on the science application. A task will use as many or as little cores as needed.
The only exception to that fact is in the special case of plan_class MT like the cpu tasks at Milkyway. Then BOINC has an actual control parameter --nthreads that can specifically set the number of cores allowed in the MT plan_class task.
That cannot be used here because the Python tasks are not a simple cpu only MT type task. They are something completely different and something that BOINC does not know how to handle. They are a dual cpu-gpu combination task where the majority of computation is done on a cpu with bursts of activity on a gpu and then computation repeats that action.
It would take a major rewrite of core BOINC code to properly handle this type of machine-learning, reinforcement learning combo tasks. Unless BOINC attracts new developers that are willing to tackle this major development hurdle, the best we can do is just accommodate these tasks through other host controls. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.
That is what is limiting your Downloads. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.
That is what is limiting your Downloads.
I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it? |
|
|
|
The reason Reinforcement Learning agents do not currently use the whole potential of the cards is because the interactions between the AI agent and the simulated environment are performed on CPU while the agent "learning" process is the one that uses the GPU intermittently.
There are, however, environments that only use GPU. They are becoming more and more common, so I see it as a real possibility that in the future most popular benchmarks of the field use only GPU. Then the jobs will be much more efficient since pretty much only GPU will be used. Unfortunately we are not there yet...
a suggestion for whenever you're able to move to to pure GPU work. PLEASE look into and enable "automatic mixed precision" in your code.
https://pytorch.org/docs/stable/notes/amp_examples.html
this should greatly benefit those devices which have Tensor cores. to speed things up.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.
That is what is limiting your Downloads.
I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?
I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Make sure there are NO checkmarks on any selection in the Disk and memory tab of the BOINC Manager Options >> Computing Preferences page.
That is what is limiting your Downloads.
I had removed these checkmarks already before.
What I did now was to stop new Rosetta tasks (which also need a lot of disk space for their VM files), so the free disk space climbed up to about 80GB - only then the Python download worked. Strange, isn't it?
I think your issue is your use of a fixed ram disk size instead of a dynamic pagefile that is allowed to grow larger as needed.
I just noticed the same problem with Rosetta Python tasks. So this may be in some kind of relation with the Python architecture.
Also in the Rosetta case, the actual disk space available was significantly higher than Rosetta said it would need.
So I don't believe that this has anything to do with the fixed ram disk size. What is the logic behind your assumption?
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
If you read the through the various posts, including mine, or investigate the issues with Pytorch on Windows, it is because of the nature of how Windows handles reservation of memory addresses compared to how Linux handles that.
The Pytorch libraries when downloaded and expanded ask for many gigabytes of memory. Windows has to set aside every bit of memory space that the application asks for whether it will be needed or not. Linux does not have to abide by this fact since it handles memory allocation dynamically automatically.
And since every Python task is likely different, there is no reuse of the previous Pytorch libraries likely, so every task needs to get all of its configured resources every time a new task is executed.
So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks.
Read this explanation please for the actual particulars of the problem with Windows. https://www.gpugrid.net/forum_thread.php?id=5322&nowrap=true#58908
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
So the best method to satisfy this fact on Windows is to start with a 35GB minimum size pagefile with a 50GB maximum size and allow the pagefile to size dynamically between that range. Your fixed ram disk size just isn't flexible enough or large enough apparently. That pagefile size seems to be sufficient for the other Windows users I have assisted with these tasks.
thanks for the hint, I will adapt the page file size accordingly and see what happens.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys
"""
if __name__ == "__main__":
import sys
sys.stderr.write("Starting!!\n")
import os
os.environ["MKL_DEBUG_CPU_TYPE"] = "5"
import platform
"""
____________
|
|
|
|
Not sure if it would have made a difference, but I would have placed your code before line 433, only after importing os and sys
"""
if __name__ == "__main__":
import sys
sys.stderr.write("Starting!!\n")
import os
os.environ["MKL_DEBUG_CPU_TYPE"] = "5"
import platform
"""
thanks :) I'll try anyway
edit - nope, no different.
____________
|
|
|
|
really unfortunate to use so much more resources on AMD than Intel. It's something about the multithreaded nature of the main run.py process itself. on intel it uses about 2-5% per process, and more run.py processes spin up the more cores you have. with AMD, it uses like 20-40% per process, so with high core count CPUs, that makes total CPU utilization crazy high.
here is what it looks like running 4x python tasks (2 GPUs, 2 tasks each) on an intel 8-core, 16-thread system. what you're seeing is the 4 main run.py processes and their multithreaded components. notice that the total CPU used by each main process is a little more than 100%, this equates to a full thread for each process.
now here is what it looks like running only 2x python tasks (1 GPU, 2 tasks each) on an AMD EPYC system with 24-cores, 48-threads. you can see the main run.py multithread components each using 20-40%, and each thread cumulatively using 600-800% CPU, EACH. that's 6-8 whole threads occupied for a single process. making it roughly 6-8x more resource intensive to run on AMD than Intel.
I even swapped my 8c/16t intel CPU for a 16t/32c one, and while it spun up a more multithread components for the main run.py, each one was still only 2-5% used making it only about 150% CPU used from each main process. something definitely weird going on with these task between AMD and Intel
the CPU used by the 32x multiprocessing.spawns is about the same between intel and AMD. it's only the threads that stem from the main run.py process that's showing this huge difference.
____________
|
|
|
DiplomatSend message
Joined: 1 Sep 10 Posts: 15 Credit: 853,849,648 RAC: 5,979,954 Level
Scientific publications
|
No. You cannot alter the task configuration. It will always create 32 spawned processes for each task during computation.
If the task is interfering with your other cpu tasks then you have a choice, either stop the Python tasks or reduce your other cpu tasks.
All you can do for making the Python task run reasonably well is assign 3-5 cpu cores for BOINC scheduling to keep other cpu work off the host.
You can do that through a app_config.xml file in the project directory.
Like this:
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35% |
|
|
|
does it improve GPU utilization? on average I see barely 20% with seldom spikes up to 35%
not directly. but if your GPU is being bottlenecked by not enough CPU resources then it could help.
the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU.
____________
|
|
|
gemini8 Send message
Joined: 3 Jul 16 Posts: 31 Credit: 2,212,787,676 RAC: 4,959,470 Level
Scientific publications
|
Hi everyone.
the best configuration so far is to not run ANY other CPU or GPU work. run only these tasks, and run 2 at a time to occupy a little more GPU.
I'm thinking about putting every other Boinc CPU work into a VM instead of running it directly on the host.
You could have a VM using only 90 per cent of processing power through the VM settings.
This would leave the rest for the Python stuff, so on a sixteen-thread CPU it could use 160% of one thread's power or 10% of the CPU.
If this wasn't enough the VM could be adjusted to only using eighty per cent (320% of one thread's power or 20% of the CPU for the Python work) and so on.
Return [adjust and try] until the machine does fine.
Plus, you could run other GPU stuff on your GPU to have it fully utilized which should prevent high temperature variations which I see as unnecessary stress for a GPU.
MilkyWay has a small VRAM footprint and doesn't use a full GPU, and maybe I'll try WCG OPNG as well.
____________
- - - - - - - - - -
Greetings, Jens |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
... and maybe I'll try WCG OPNG as well.
forget about WCG OPNG for the time being. Most of the time no tasks available; and if tasks are available for a short period of time, it's extremely hard to get them downloaded. The downloads get stuck most of the time, and only manual intervention helps.
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.
They save checkpoints well which are replayed to get the task back to the point in progress it was at before interruption.
Just be advised, that the replay process takes a few minutes after restart. The task will show 2% completion percentage upon restart but will eventually jump back to the progress point it was at and continue calculation until end.
Just be patient and let the task run. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.
I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702
That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.
I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702
That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.
Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The restart is supposed to work fine on Windows as well. Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task?
Is there anyone for which the Windows checkpointing works fine? I tested locally and it worked.
____________
|
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Could you provide more information about when this error happens please? Does it happen systematically every time you interrupt and try to resume a task?
I can pause and restart them with no problem. The error occurred only on reboot.
But I think I have found it. I was using a large write cache, PrimoCache, set with a 8 GB cache size and 1 hour latency. By disabling that, I am able to reboot without a problem. So there was probably a delay in flushing the cache on reboot that caused the error.
But I used the write cache to protect my SSD, since I was seeing writes of around 370 GB a day, too much for me. But this time I am seeing only 200 GB/day. That is still a lot, but not fatal for some time. It seems that the work units vary in how much they will write. I will monitor it.
I use SsdReady to monitor the writes to disk; the free version is OK.
PS - I can set PrimoCache to only a 1 GB write-cache size with a 5 minute latency, and it reboots without a problem. Whether that is good enough to protect the SSD will have to be determined by monitoring the actual writes to disk. PrimoCache gives a measure of that. (SsdReady gives the OS writes, but not the actual writes to disk.)
PPS: I should point out that the reason a write cache can cut down on the writes to disk is because of the nature of scientific algorithms. They invariable read from a location, do a calculation, and then write back to the same location much of the time. Then, the cache can store that, and only write to the disk the changes that occur at the end of the flush period. If you have a large enough cache, and set the write-delay to infinite, you essentially have a ramdisk. But the cache can be good enough, with less memory than a ramdisk would require. (And now it seems that 2 GB and 10 minutes works OK.) |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Question for the experts here:
One of my PCs has 2 RTX3070 inside, Pythons are running quite well.
The interesting thing is that VRAM usage of one GPU always is about 3.7GB, usage of the other always is about 4.3GB.
So with one of the GPUs I could (try to) process 2 Pythons simultaneously, with the other not (VRAM of the RTX3070 is 8GB).
Is it possible to arrange for such a setting via app_config.xml?
BTW, I know what the app_config.xml looks like for running 2 Pythons on both GPUs (<gpu_usage>0.5</gpu_usage>), but I have no idea how to configure the xml according to my wishes as outlined above.
Can anyone help? |
|
|
|
Sorry. There is no way to configure an app_config to differentiate between devices.
You can only have different settings for different applications.
The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x.
But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display. a second task wont use 4.3GB again. most likely only another +3.6
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Sorry. There is no way to configure an app_config to differentiate between devices.
You can only have different settings for different applications.
The only option, which you might not want to do, is to run two different BOINC clients on the same system, to the project this will look like two different computers each having one GPU. Then you could configure one to run 2x and the other to run 1x.
But the amount of VRAM used by the Python app is likely the same between your cards. But the first GPU will always have more vram used because it’s running your display.
In fact, I have 2 BOINC clients on this PC; I had to establish the second one with the BOINC DataDir on the SSD, since the first one is on the 32GB Ramdisk which would not let download Python tasks ("not enough disk space").
However, next week I will double the RAM on this PC, from 64 to 128GB, and then I will increase the Ramdisk size to at least 64GB; this should make it possible to download Python - at least that' what I hope.
So then I could run 1 Python on each of the 2 GPUs on the SSD client, and a third Python on the Ramdisk client.
The only two questions now are: how do I tell the Ramdisk client to run only 1 Python (although 2 GPUs available)? And how do I tell the Ramdisk client to choose the GPU with the lower amount of VRAM usage (i.e. the one that's NOT running the display)?
In fact, I would prefer to run 2 Pythons on the Ramdisk client and 1 Python on the SSD client; however, the question is whether I could download 2 Pythons on the 64GB Ramdisk - the only thing I could do is to try.
|
|
|
|
please read the BOINC documentation for client configuration. all of the options and what they do are in there.
https://boinc.berkeley.edu/wiki/Client_configuration
you will need to change several things to run multiple clients at the same time. you need to start them on different ports, as well as add several things to cc_config. you will also need to exclude the GPU you dont want to use from each client.
either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project)
or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project)
____________
|
|
|
|
personally I would stop running the ram disk. it's just extra complication and eats up ram space that the Python tasks crave. your biggest benefit will be moving to linux, it's easily 2x faster, maybe more. I don't know how you have your systems set up, but i see your longest runtimes on your 3070 are like 24hrs. that's crazy long. are you not leaving enough CPU available? are you running other CPU work at the same time?
for comparison, I built a Linux machine dedicated to these tasks. 2x RTX 3060 and a 24-core EPYC CPU and 128GB system ram. I am not running any other work on it, only PythonGPU. to give these tasks the optimum conditions to run as fast as possible.
with 12GB of VRAM, i can run 3x per GPU and it completes tasks in about 13hrs at the longest, for an effective longest completion time of about 1 task every 4.3hrs, which means at minimum, this system with 2x GPUs (6x tasks running) completes about 11 tasks per day (1,155,000 cred) + the bonus of some tasks completing earlier. you can see that my 3060 in this system is 6x more productive than your 3070. that's an insane difference
doing this uses about 80-90% of the CPU, and ~56GB of system ram. I have enough spare VRAM to add another GPU, but maybe not enough CPU power to support more than 1 more task. if I want another GPU i will probably need a more powerful (more cores) CPU.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
...
either use the <exclude_gpu> section (where BOINC can see the device but wont use it for a given project)
or use the <ignore_nvidia_dev> tag (where BOINC wont see this device at all for any project)
thanks very much for your hints:-)
One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start:
"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes"
So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM.
In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory).
Any idea how mugh system RAM, roughly, a Python task takes?
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
One other thing that I now noticed when reading the stderr of the 3 Pythons that failed short time after start:
"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes"
So the reason why the tasks crashed after a few seconds was not the too small VRAM (this would probably have come up a little later), but the lack of system RAM.
In fact, I remember that right after start of the 4 Pythons, the Meminfo tool showed a rapid decrease of free system RAM, and shortly thereafter the free RAM was going up again (i.e. after 3 tasks had crashed thus releasing memory).
Any idea how mugh system RAM, roughly, a Python task takes?
From what I can see in the Windows Task Manager on this PC and on others running Python tasks, RAM usage of a Python can be from about 1GB to 6GB (!)
How come that it varies that much?
|
|
|
|
you should figure 7-8GB per python task. that's what it seems to use on my linux system. i would imagine it uses a little when the task starts up, then slowly increases once it gets to running full out. that might be the reason for the variance of 1GB in the beginning, and 6+GB by the time it gets to running the main program.
these tasks work in 3 phases from what i've seen
Phase 1: extraction phase. just extracting the compressed package. usually takes about 5 minutes, depending on CPU speed. uses only a single core.
Phase 2: pre-processing and/or pre-loading. uses a large % of CPU power, GPU gets intermittently used, and VRAM preloads to about 60% of what will be eventually used. (in my case, VRAM preloads about 2100MB). this also lasts about 5 mins.
Phase 3: main program. CPU use drops down, and VRAM use loads up to 100% of what is needed (in my case 3600MB per task).
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Erich56 asked: Question: can a running Python task be interrupted for the time a Windows Update takes place (with rebooting of the PC), or does this damage the task?
I tried it now - the two tasks running on a RTX3070 each - on Windows - did not survive a reboot :-(
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
since yesterday I upgraded the RAM of one of my PCs from 64GB to 128GB (so now I have a 64GB Ramdisk plus 64GB system RAM, before it was half each), every GPUGRID Python fails on this PC with 2 RTX3070 inside.
The task starts okay, RAM as well as VRAM is filling up continuously, also the CPU usage is close to 100%, and after a while (a few minutes up to half an hour) the task fails.
The BOINC manager says "aborted by the project", and the task description says "aufgegeben" = abandoned or so.
Interestingly, no times are shown, neither runtime nor CPU time, further there is no stderr.
See this example:
https://www.gpugrid.net/result.php?resultid=33044774
on another machine, I have two tasks running simultaneously on one GPU - no problem at all.
I was of course thinking of a defective RAM module; however, all night through I had running simultaneously 5 LHC ATLAS tasks 3-cores ea., without any problem. So I guess this was RAM test enough.
Also hundreds of WCG GPU tasks were processed this morning for hours, also without any problem.
Anyone and ideas ? |
|
|
DiplomatSend message
Joined: 1 Sep 10 Posts: 15 Credit: 853,849,648 RAC: 5,979,954 Level
Scientific publications
|
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
I'm new to config editing :) a few more questions
Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?
Because I don't see 3 CPUs been given to the task after client restart
Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.
I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702
That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.
Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.
The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Anyone and ideas ?
Get rid of the ram disk. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
I'm new to config editing :) a few more questions
Do I need to be more specific in <name> tag and put full application name like Python apps for GPU hosts 4.03 (cuda1131) from task properties?
Because I don't see 3 CPUs been given to the task after client restart
Application Python apps for GPU hosts 4.03 (cuda1131)
Name e00015a03227-ABOU_rnd_ppod_expand_demos25-0-1-RND8538
State Running
Received Tue 20 Sep 2022 10:48:34 PM +05
Report deadline Sun 25 Sep 2022 10:48:34 PM +05
Resources 0.99 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
CPU time 00:48:32
CPU time since checkpoint 00:00:07
Elapsed time 00:11:37
Estimated time remaining 50d 21:42:09
Fraction done 1.990%
Virtual memory size 18.16 GB
Working set size 5.88 GB
Directory slots/8
Process ID 5555
Progress rate 6.840% per hour
Executable wrapper_26198_x86_64-pc-linux-gnu
Any already downloaded task will see the original cpu-gpu resource assignment.
Any newly downloaded task will show the NEW task assignment.
The name for the tasks is PythonGPU as you show.
You should always refer to the client_state.xml file as it is the final arbiter of the correct naming and task configuation. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes, one great thing about the Python tasks compared to acemd3 tasks is that they can be interrupted and continue on later with no penalty.
I have a problem that they fail on reboot however. Is that common?
http://www.gpugrid.net/results.php?hostid=583702
That is only on Windows though. I have not seen it yet on Linux, but I don't reboot often there.
Guess it must be only on Windows. No problem restarting a task after a reboot on Ubuntu.
The restart works fine on Windows. Maybe, it might be the five-minute break at 2% which might be causing the confusion.
If you interrupt the task in its Stage 1 of downloading and unpacking the required support files, it may fail on Windows upon restart.
It normally shows the failure for this reason in the stderr.txt.
Best to interrupt the task once it is actually calculating and after its setup and has produced at least one checkpoint. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Anyone and ideas ?
Get rid of the ram disk.
on the other hand, ramdisk works perfectly on this machine:
https://www.gpugrid.net/show_host_detail.php?hostid=599484 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Anyone and ideas ?
Get rid of the ram disk.
on the other hand, ramdisk works perfectly on this machine:
https://www.gpugrid.net/show_host_detail.php?hostid=599484
Then you need to investigate the differences between the two hosts.
All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.
Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.
Then you have identified why the tasks fail. |
|
|
DiplomatSend message
Joined: 1 Sep 10 Posts: 15 Credit: 853,849,648 RAC: 5,979,954 Level
Scientific publications
|
Keith Myers thanks! |
|
|
DiplomatSend message
Joined: 1 Sep 10 Posts: 15 Credit: 853,849,648 RAC: 5,979,954 Level
Scientific publications
|
In my case config didn't want to work until I added <max_concurrent>
<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)
Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
In my case config didn't want to work until I added <max_concurrent>
<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
Now I see as expected status: Running (3 CPUs + 1 NVIDIA GPU)
Unfortunately it doesn't help to get high GPU utilization/ Completion time it looks like gonna be slightly better though
If you have enough cpu for support and enough VRAM on the card, you can get better gpu utilization by moving to 2X tasks on the card. Just change the gpu_usage to 0.5
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Anyone and ideas ?
Get rid of the ram disk.
on the other hand, ramdisk works perfectly on this machine:
https://www.gpugrid.net/show_host_detail.php?hostid=599484
Then you need to investigate the differences between the two hosts.
All I'm stating is that the RAM disk is an unnecessary complication that is not needed to process the tasks.
Basic troubleshooting. Reduce to the most basic, absolute needed configuration for the tasks to complete correctly and then add back in one extra superfluous element at a time until the tasks fail again.
Then you have identified why the tasks fail.
I installed a RAMdisk because quite often I am crunching tasks which write many GB of data on the disk. E.g. LHC-Atlas, the GPU tasks from WCG, the Pythons from Rosetta, and last not least the Pythons from GPUGRID: about 200GB within 24 hours, which is much (so for my two RTX3070, this would be 400GB/day).
So, if the machines are running 24/7, in my opinion this is simply not good for a SSD lifetime.
Over the years, my experience with RAMdisk has been a good one. No idea what kind of problem the GPUGRID Pythons have with this particular RAMDisk - or vice versa. As said, on another machine with RAMDisk I also have 2 Pythons running concurrently, even on one GPU, and it works fine.
So what I did yesterday evening was letting only one of two RTX3070 crunch a Python. On the other GPU, I sometimes crunched WCG of nothing at all.
This evening, after about 22-1/2 hours, the Python finished successfully :-)
BTW - beside the Python, 3 ATLAS tasks 3 cores ea. were also running all the time.
Which means. what I know so far is that obviously I can run Pythons at least on one of the two RTX3070, and other projects on the other one.
Still I will try to further investigate why GPUGRID Pythons don't run on both RTX3070. |
|
|
|
I do not know how to properly mention the project administrators in the topic in order to draw attention to the problem of non-optimal use of disk space by this application.
Only now I noticed what is contained in the slotX directory when performing a task.
I was very surprised to see there, in addition to the unpacked application files, also the archive itself, from which these files are unpacked/unzipped. At the same time, the archive is present in two copies at once, apparently due to the suboptimal process of unpacking the format tar.gz.
Here you can see that application's files itself occupy only half the working directory volume(slotX).
Apparently, when the application starts, the following happens:
1) The source archive(pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17) of application is copied from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\).
2) Then archive is unzipped (tar.gz >> tar).
3) At the last stage, the application files are unpacked from tar container.
At the same time, at the end of the process, unnecessary tar and tar.gz files( for some reason) does not deleted from working directory.
Thus, not only the peak amount of space occupied of each instance of this WU requires ~16 GiB, but this volume is occupied until WU's completing.
The whole process requires both much more time (copying and unpacking) and amount of written data.
Project tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (5,46 GiB) = 13,6 GiB
Both parameters can be significantly reduced by unpacking files directly into the working directory from the source archive, without all mentioned intermediate stages.
7za, which is used for unzipping/unpacking archives supports pipelining:
7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar -o"X:\BOINC\slots\0\"
Project tar.gz >> app files (5,46 GiB) = 5,46 GiB !
Moreover, if you use for archive not tar.gz format, but 7z (LZMA2 + "5 - Normal" profile, which is the default for recent 7-zip versions), then you can not only seriously save the amount of data downloaded by each user (and as a consequence the bandwidth of project's infrastructure), but speed up the process of unpacking data from archive.
Saving more than one GiB:
On my computer, unpacking by pipelining(as mentioned above) using the current(12 years old) 7za version(9.20) takes ~100 seconds.
And when using the recent version of 7za(22.01) only ~ 45-50 seconds.
7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.7z" -o"X:\BOINC\slots\0\"
I believe that the result of the described changes is worth implementing them (even if not all and/or not at once).
Moreover, all changes are reduced only to updating one executable file, repacking the archive and changing the command to unpack it. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I believe the researcher has already been down this road with Windows not natively supporting the compression/decompression algorithms you mention.
It requires each volunteer to add support manually to their hosts.
In the quest for compatibility, a researcher tries to package applications for all attached hosts to run natively without jumping through hoops so that everyone can run the tasks.
|
|
|
|
It requires each volunteer to add support manually to their hosts.
No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing. |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
Yes, I do have GPUGrid installed on my Win10 machine after all.
And 7za.exe is in the project folder, just not in the project folder on my Ubuntu machine. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
It requires each volunteer to add support manually to their hosts.
No
Unfortunately, you have inattentively read what I wrote above.
It has already been mentioned there that is currently Windows app already comes with 7za.exe version 9.20(you can find it in project folder).
So nothing changing.
OK, so you can thank Richard Haselgrove for the application to now package that utility. Originally, the tasks failed because Windows does not come with that utility and Richard helped debug the issue with the developer.
If you think the application is not using the utility correctly you should inform the developer of your analysis and code fix so that other Windows users can benefit. |
|
|
|
you should inform the developer of your analysis and code fix so that other Windows users can benefit.
I have already sent abouh PM to this tread, just in case. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello, thank you very much for your help. I would like to implement the changes if they help optimise the tasks, but let me try to summarise your ideas to see if I got them right:
Change A --> As you say, the original file .tar.gz is first copied to the working directory and then unpacked in a 2-step process (tar.gz to tar and tar to plain files) and the tar.gz and tar files lie around after that. You suggest that these files should be deleted to save space and I agree, makes sense. Probably the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to .tar
3) delete .tar.gz file
4) unpack .tar file to plain files
5) delete .tar file
This one is straightforward to implement.
Change B --> Additionally, you also suggest to replace the copying and the 2-step unpacking process for a single step process with the command line you propose. So the sequence would be further simplified to:
1) unpack .tar.gz to plain files
2) delete .tar.gz file
The only problem I see here is that I believe the step of first copying the files from the project directory(\projects\www.gpugrid.net) to the working directory(\slots\X\) I can not modify. It is general for all projects, even for the ones that do not contain files to be unpacked later. So not to mess with other GPUgrid projects the sequence should be:
1) move .tar.gz file from project directory to working directory.
2) unpack .tar.gz to plain files
3) delete .tar.gz file
in this case, would the command line would be simply this one? without the -o flag?
7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar
Change C --> Finally, you suggest using .7z encryption instead of .tar.gz to save memory and unpacking time with a more recent version of 7za.
Is all the above correct?
I believe these changes are worth implementing, thank you very much. I will try to start with Change A and Change B and unroll them into PythonGPUbeta first to test them this week.
____________
|
|
|
|
Looks good to me. Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za? I think it's unlikely to affect us, but it would be good to check, just in case.
I mention it, because the original trial runs used native Windows tar decompression (the same as the Linux implementation): but that was only introduced in later versions of Windows 10 and 11. Some of us (myself included) still use Windows 7, which supports 7z but not tar. A reasonable degree of backwards compatibility is desirable! |
|
|
|
Hi, abouh!
Change A:
You are correct.
Change B
You are correct.
2) If this can't be changed or too hard / long to implement - no big deal.
In any case, pipelining still save some time and space : )
in this case, would the command line would be simply this one? without the -o flag?
Of course, if you launch 7za from working directory(/slots/X), than output flag not necessary.
Change C
You are correct.
Using 7z format(LZMA2 compression) significantly reduce archive size, save your bandwidth and some time for unpacking/unzipping process ; )
As I wrote above, the 7za command will be simplified, since the pipelining process will no longer be required.
NB! It is important to update the supplied 7za to current version, since version 9.20, a lot of optimizations have been made for compression/decompression of 7z archives(LZMA).
Just one question - are there any 'minimum Windows version' constraints on the later versions of 7za?
As mentioned on 7-Zip homepage, app support all versions since Windows 2000:
7-Zip works in Windows 10 / 8 / 7 / Vista / XP / 2019 / 2016 / 2012 / 2008 / 2003 / 2000.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
As a very first step I am trying to remove the .tar.gz file. I am encountering a first issue. The steps of the jobs are specified in the job.xml file in the following way:
<job_desc>
<task>
<application>.\7za.exe</application>
<command_line>x .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 -y</command_line>
</task>
<task>
<application>.\7za.exe</application>
<command_line>x .\pythongpu_windows_x86_64__cuda1131.tar -y</command_line>
</task>
....
<job_desc>
Essentially I need to execute a task that removes the pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17 file after the very first task.
When I try in the Windows command prompt:
cmd.exe /C "del pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"
it works. However when I add to the job.xml file
<task>
<application>cmd.exe</application>
<command_line>/C "del .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"</command_line>
</task>
The wrapper seems to ignore it. Doesn't the wrapper have cmd.exe? I need to run more tests to figure out the exact command to delete files
____________
|
|
|
|
<task>
<application>cmd.exe</application>
<command_line>/C "del .\pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17"</command_line>
</task>
Try to use %COMSPEC% variable as alias to %SystemRoot%\system32\cmd.exe
If this doesn't work, then I'm sure specifying the full path(C:\Windows\system32\cmd.exe) should work. |
|
|
|
in other news. looks like we've finally crunched through all the tasks ready to send. all that remains are the ones in progress and the resends that will come from those.
any more coming soon?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
True! Specifying the whole path works:
<job_desc>
<task>
<application>C:\Windows\system32\cmd.exe</application>
<command_line>/C "del \pythongpu_windows_x86_64__cuda1131.tar.gz.1a152f102cdad20f16638f0f269a5a17</command_line>
</task>
</job_desc>
I have deployed this Change A into the PythonGPUbeta app, just to test if it works in all Windows machines. Just sent a few (32) jobs. If it works fine on, will move on to introduce the other changes.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I will be running new experiments shortly. My idea is to use the whole capacity of the grid. I have already noticed that a few months ago it could absorb around 800 tasks and now it goes up to 1000! Thank you for all the support :)
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The first batch I sent to PythonGPUbeta yesterday failed, but I figured out the problem this morning. I just sent another batch an hour ago to the PythonGPUbeta app. This time seems to be working. It has Change A implemented, so memory usage is more optimised.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello Aleksey!
I was looking at how to implement Chance C, namely if we can encode and decode the task conda-environment files using 7zip format and recent versions of 7za.exe.
We use conda-pack to compress the conda environment that we later unpack in the gpugrid windows machines using 7za.exe.
However, looking at the documentation seems like 7zip is not a format conda-pack can deal with. https://conda.github.io/conda-pack/cli.html
Apparently the possible formats include: zip, tar.gz, tgz, tar.bz2, tbz2, tar.xz, txz, tar, parcel (?), squashfs (?)
So in case of switching from the current tar.gz, we could only go to one of these. Maybe tbz2 or txz? seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format.
Any recommendation? :)
For tbz2 the file size is similar, slightly smaller. The txz file is substantially smaller but took forever (30 mins) to compress.
2.0G pythongpu_windows_x86_64__cuda102.tar.gz
1.9G pythongpu_windows_x86_64__cuda102.tbz2
1.2G pythongpu_windows_x86_64__cuda102.txz
____________
|
|
|
|
more tasks? I'm running dry ;)
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
More tasks please, also. |
|
|
bibiSend message
Joined: 4 May 17 Posts: 14 Credit: 14,957,460,267 RAC: 38,617,332 Level
Scientific publications
|
Hi,
why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3.
When it works, 7za.exe and this extra tasks are not necessary.
pythongpu_windows_x86_64__cuda1131.zip has 2,58 GB
pythongpu_windows_x86_64__cuda1131.tar.gz has 2,66 GB
|
|
|
|
Good day, abouh
This time seems to be working. It has Change A implemented,
It's nice to hear that!
Maybe tbz2 or txz?
As I understand, tbz2/txz are alias of file extension for tar.bz2/tar.xz.
So in fact these formats are tar containers which compressed by bz2 or xz.
Therefore, this will require pipelining process, which, however, practically does not affect the unpacking speed, and only lengthens command string.
In my test, unpacking of tar.xz done in ~40 seconds.
seems like this ones we can unpacked in a single step as well, if recent versions 7za.exe allow to handle this format.
xz format supported since version 9.04 beta, but more recent version support multi-threaded (de)compression, witch crucial for fast unpacking.
The txz file is substantially smaller but took forever (30 mins) to compress.
This format use LZMA2 algorithm, similar as 7z use by default. So space saving must be the same with the same settings(--compress-level).
It's highly likely you forgot to use this flag
--n-threads <n>, -j <n>
to set number of threads to use for compression. By default conda-pack use only 1 thread!
And also check --compress-level. Levels higher then 5 not so effective for compression_time/archive_size.
Considering how I think that PythonGPU's app file rarely changes, it's not big deal.
As far as I remember, this (practically) does not affect unpacking speed.
On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB). |
|
|
|
why not producing a zip file, because the boinc client can unzip such file direct from the project folder to the slot like with acemd3.
You're probably right.
I somehow didn't pay attention to acemd3 archives in project directory.
Is there some info, how BOINC's work with archives?
I suppose boinc-client uses its built-in library to work with archives (zlib ?), rather than some OS functions/tools.
There's still a dilemma:
1) On the one hand, using zip format will simplify process of application launching and reduce the amount of disk space required by application (no need to copy archive to the working directory). Amount of written data on disk reduced accordingly.
2) On other hand, xz format reduce archive size by whole GiB, that helps to save project's network bandwidth and time to download necessary files at first users access to project. |
|
|
|
On my test(32 threads / Threadripper 2950X), it took ~2,5 minutes with compress-level 5(archive size 1,55 GiB).
It's about compression* |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
We tried to pack files with zip at first but encountered problems in windows. Not sure if it was some kind of strange quirk in the wrapper or in conda-pack (the tool for creating, packing and unpacking conda environments, https://conda.github.io/conda-pack/), but the process failed for compressed environment files above a certain memory size.
We then tried to used another format that could compress the files to a smaller size than .zip. We tried .tar but not all windows version have tar.exe (old ones do not).
We finally found this solution of sending 7za.exe along with the conda packed environment to be able to unpack it as part of the job.
I am not 100% sure, but I suspect acemd3 does not use PyTorch machine learning python framework, which increases substantially the size of the packed environment. And I believe acemd4 does use pytorch, and faces the same issue as the PythonGPU tasks.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
You were absolutely right, I forgot the number of threads! I could now reproduce a a much faster compression as well.
I will proceed to test if I can use the BOINC wrapper and a newer version of 7za.exe to unpack it locally in a reasonable amount of time and then will deploy it to PythonGPUbeta for testing.
Thank you very much!
____________
|
|
|
bibiSend message
Joined: 4 May 17 Posts: 14 Credit: 14,957,460,267 RAC: 38,617,332 Level
Scientific publications
|
Hi abouh,
the provided 7za.exe has version 9.20 from 2010. The last version on 7-zip.org is 22.01 (now 7z.exe).
If you want to unpack in a pipe or delete the tar file, you need cmd. But the used starter wrapper_6.1_windows_x86_64.exe (see project folder) don't know about environment and the windows folder isn't necessarily c:\windows, so you also should provide cmd.exe.
Unpacking in a pipe:
<task>
<application>.\cmd.exe</application>
<command_line>/c .\7za.exe -so x pythongpu_windows_x86_64__cuda1131.tar.xz | .\7za.exe -y -sifile.txt.tar x & exit</command_line>
<weight>1</weight>
</task>
Why conda-pack with format zip is not working I don't know.
|
|
|
bibiSend message
Joined: 4 May 17 Posts: 14 Credit: 14,957,460,267 RAC: 38,617,332 Level
Scientific publications
|
7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html
But your version works too.
|
|
|
|
the provided 7za.exe has version 9.20 from 2010. The last version on 7-zip.org is 22.01
7z.exe calls the dll, 7za.exe stands alone. You find it in 7-Zip Extra on https://7-zip.org/download.html
All this has already been discussed by several posts above.
If you had read before writing...
so you also should provide cmd.exe.
I think this is not a good idea.
Some antiviruses may perceive an attempt to launch cmd.exe not from the system directory as suspicious/malicious activity.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I added the discussed changes and deployed them to the PythonGPUbeta app. More specifically:
1. I changed the 7za.exe executable to (I believe) the latest version. A much newer one than the one previously used in any case.
2. I compress now the conda-environment files to .txz. I use the default --compress-level (4), because I tried with 9 and the compressed file size was the same.
As Aleksey mentioned, the unpacking still needs to be done in 2 steps, but at least now the sent files are smaller due to a more efficient compression.
Did anyone catch any of the PythonGPUbeta jobs? They seemed to work
Regarding what bibi mentioned, /Windows/System32/cmd.exe seems to be present in all Windows machines so far, or at least I have not seen any job failing because of this. I have sent 64 test jobs in total.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
No, I haven't been lucky enough yet to snag any of the beta tasks. |
|
|
|
One of my Linux machines has just crashed two tasks in succession with
UnboundLocalError: local variable 'features' referenced before assignment
https://www.gpugrid.net/results.php?hostid=508381
Edit - make that three. And a fourth looks to be heading in the same direction - many other users have tried it already. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thanks for the warning Richard, I have just fixed the error. Should not be present in the jobs starting a few minutes from now.
____________
|
|
|
|
Yes, the next one has got well into the work zone - 1.99%. Thank you. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Just an observation.
Boinc does not consider a GPUGrid task as a task. Yesterday my finger brushed against Moo's "allow new WU's" and it promptly downloaded 12 WU"s. They, were all 12 running with the GPUGrid task also running? Never seen that before. I took remedial action. None errored out. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I tried to run 1 Python on a second BOINC instance.
So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours.
On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use.
Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU.
After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB.
However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before.
The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task.
Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430
I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again.
What's going wrong?
FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU.
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
My question is, how can 13 tasks run on a 12-thread machine? Is it a good idea to run other tasks? Also, why was Boinc not taking into account the GPUGrid task? |
|
|
|
If the 13th task is assessed - by the project and BOINC in conjunction - to require less than 1.0000 of a CPU, it will be allowed to run in parallel with a fully occupied CPU. For a GPU task, it will run at a slightly higher CPU priority, so it will steal CPU cycles from the pure CPU tasks - but on a modern multitasking OS, they won't notice the difference. |
|
|
|
I tried to run 1 Python on a second BOINC instance.
So far, they have run on the "regular" instance, 1 task ea. on 2 RTX3070, without problems. Runtime was about 22-23hours.
On the "regular" instance I now run 2 Primegrid tasks, such ones with GPU use only, no CPU use.
Hence, to run Pythons in addition would be a nice supplement - using a lot of CPU and only part of the GPU.
After I started a Python on the second BOINC instance, all ran normal for a short while: CPU usage climed up close to 100%, VRAM usage was close to 4GB, system RAM some 8GB.
However, after a few minutes, CPU usage for the Python went down to about 15%. RAM and VRAM usage stayed at same level as before.
The progress bar in the BOINC manager showed some 2.980% after about 3 hours. So it was clear that something was going wrong, and I aborted the task.
Stderr can be seen here: https://www.gpugrid.net/result.php?resultid=33056430
I then started another task, just to preclude that the problem from before was a "one-timer". However, same problem again.
What's going wrong?
FYI, recently I ran altogether 3 Pythons on 2 RTX3070, which means on one of the RTX two Pythons were crunched simultaneously. No problem at all, the total runtime for each of the two tasks was just a little longer than for 1 task per GPU.
i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed).
if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.
switch to Linux for even better performance.
____________
|
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,569,300,388 RAC: 3,786,488 Level
Scientific publications
|
Erich56
The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it.
Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed.
Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there.
There are some jobs that just fail with an unknown cause but these haven't gotten that far yet.
8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else.
Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it.
Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first.
After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time.
While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Ian&Steve C. wrote:
i think you're trying to do too much at once. 22-24hrs is incredibly slow for a single task on a 3070. my 3060 does them in 13hrs, doing 3 tasks at a time (4.3hrs effective speed).
if you want any kind of reasonable performance, you need to stop processing other projects on the same system. or at the very least, adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.
switch to Linux for even better performance.
I agree, at the moment it may be "too much at once" :-)
FYI, I recently bought another PC with 2 CPUs (8-c/8-HT each) and 1 GPU, I upgraded the RAM from 128GB to 256GB and created a 128GB Ramdisk;
and on an existing PC with a 10-c/10-HT CPU plus 2 RTX3070 I upgraded the RAM from 64GB to 128GB (=maximum possible on this MoBo).
So no surprise that now I am just testing what's possible. And by doing this, I keep finding out, of course, that sometimes I am expecting too much.
What concerns the (low) speed of my two RTX3070: I have always been on the very conservative side what concerns GPU temperatures. Which means I have them run on about 60/61°C, not higher.
With two such GPUs inside the same box, heat of course is a topic. Despite of good airflow, in order to keep the GPUs at the above mentioned temperature, I need to throttle them down to about 50-65% (different for each GPU). So this explains for the longer runtimes of the Pythons.
If I had to boxes with 1 RTX3070 inside each, I am sure that there would be no need for throtteling. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
jjch wrote:
Erich56
The first two tasks I checked you didn't let them finish extracting. The others looks a bit inconclusive however you restarted the tasks so that could be it.
Leave them alone and let them run. If they stall at 2% for an extended time check the stderr file to see if there is an error that should be addressed.
Look to see if they are actually running or not before you abort. If its working it should get to the Created Learner. step and continue running from there.
There are some jobs that just fail with an unknown cause but these haven't gotten that far yet.
8Gb system memory is on the low side to run Python apps successfully. It can be done but you really shouldn't be running anything else.
Also, the Python apps need up to 48Gb of swap space configured on Windows systems. If you haven't already done it I would suggest increasing it.
Simplify your troubleshooting and cut down on the variables. Run only one Boinc instance and one Python task. See how that goes first.
After you confirm that's working you can possibly run an additional Python task or maybe a different GPU project at the same time.
While generally you do want to maximize the usage of your system it's not good to slam it to the ceiling either.
thanks for taking your time for dealing with my problem.
well, by now it's become clear to me what the cause for failure was:
obviously, running a Primegrid GPU task and Python on the same GPU does not work for the Python. After a Primegrid got finished, I started another Python, and it runs well.
What concerns memory, you may have misunderstood: when I mentioned the 8GB, I meant to say that I could see in the Windows Task Manager that Python was using 8GB. Total RAM on this machine is 64GB, so more than enough.
Also what concerns the swap space: I had set this manually to 100GB min. and 150 GB max., so also more than enough.
Again - the problem has been detected anyway. Whereas I had no problem to run two Pythons on the same GPU (even 3 might work), it is NOT possible to have a Python run along with a Primegrid task.
So for me, this was a good learning process :-)
Again, thanks anyway for your time investigating my failed tasks.
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I just discovered the following problem on the PC which consists of:
2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory
until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).
Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.
So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.
There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.
What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?
Can anyone help me to get out of this problem? |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I just discovered the following problem on the PC which consists of:
2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory
until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).
Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.
So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.
There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.
What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?
Can anyone help me to get out of this problem?
Meanwhile, the problem has become even worse:
After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine.
Hence, nothing else can be downoladed.
When trying to download tasks from other projects, it shows
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment.
Can anyone give me advice how to get this problem solved? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion.
The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
It can't. Due to the dual nature of the python tasks, BOINC has no mechanism to correctly show the estimated time to completion.
The tasks do not take the time shown to complete and can in fact be returned well within the standard 5 day deadline.
But how come that on three other of my systems on which I am running Pythons for a while, the "remaining runtimes" are shown pretty correctly (+/- 24 hours)?
And also on the machine in question, up to recently the time was indicated okay.
Something must have happened yesterday, but I do not know what.
If your assumption was right, on no Boinc instance more than 1 Python could be run in parallel.
Didn't you say somewhere here in the forum that you are running 3 Pythons in parallel? How can a second and a third task be downloaded if the first one shows a remaining runtime of 30 or 60 days?
What are the remaining runtimes shown for your Pythons once they get started?
|
|
|
kksplaceSend message
Joined: 4 Mar 18 Posts: 53 Credit: 2,591,271,749 RAC: 6,720,230 Level
Scientific publications
|
Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. I originally had Resource shares of 160 for GPUGrid vs 10 for Einstein and 40 for TN-Grid. Since the Python tasks 'use' so much CPU time in particular (at least reported CPU time), it seems to affect the Resource Share calculations at well. I had to move my Resource Share of GPUGrid (for example) to 2,000 to get it both to do two at once and to get Boinc to share with Einstein and TN-Grid roughly the way I wanted. (Nothing magic about my Resource Share ratios; just providing an example of how extreme I went to get it to balance the way I wanted.)
Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
No, that was my teammate who is running 3X concurrent on his gpus.
He runs nothing but GPUGrid on those hosts.
I OTOH run multiple projects at the same time on my hosts. So the GPUGrid tasks have to share resources. That is a balancing act.
I run a custom client that allows me to get around the normal BOINC client and project limitations. I can ask for as much or as little amount of work that I want on any host.
Currently, I am running a single task on half a gpu in each host. I tried to run 2X on the gpu but I don't have enough resources to support 2 tasks on the host and run all my other projects at the same time. But the task runs well sharing the gpu with my other gpu projects. Keeps the gpu utilization much higher than if running only the Python task.
The GPUGrid tasks start up with multiple hundreds of days expected before completion. That drops down to only a couple of days once the task gets over 90% completion.
This is what BoincTasks is showing for the 5 tasks I am currently running on my hosts for estimated completion times.
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00014a06316-ABOU_rnd_ppod_expand_demos25-0-1-RND9172_3 01:05:30 (02:57:04) 90.11 3.970 157d,17:33:34 10/7/2022 4:27:00 PM 3C + 0.5NV (d1) Running High P. Darksider
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00005a00032-ABOU_rnd_ppod_expand_demos25_2-0-1-RND9669_0 13:30:26 (04d,00:21:21) 237.79 34.660 27d,12:31:49 10/7/2022 4:02:16 AM 3C + 0.5NV (d2) Running High P. Numbskull
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00012a04847-ABOU_rnd_ppod_expand_demos25-0-1-RND2344_4 13:27:51 (01d,09:45:50) 83.59 48.520 10d,20:41:45 10/7/2022 4:05:00 AM 3C + 0.5NV (d1) Running High P. Pipsqueek
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00015a05913-ABOU_rnd_ppod_expand_demos25-0-1-RND9942_0 21:04:49 (05d,14:22:40) 212.49 39.610 28d,03:53:33 10/6/2022 8:04:45 PM 3C + 0.5NV (d2) Running High P. Rocinante
GPUGRID 4.03 Python apps for GPU hosts (cuda1131) e00008a00044-ABOU_rnd_ppod_expand_demos25_2-0-1-RND2891_2 01:23:31 (02:53:39) 69.30 3.970 22d,07:56:42 10/7/2022 4:09:00 PM 3C + 0.5NV (d0) Running High P. Serenity
I'll finish all of the tasks before 24 hours on the high clocked hosts for maximum credit awards. I'll miss out on the 24 hour bonus by a half hour or so on the server hosts because of their slower clocks.
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Regarding the estimated time to completion, I have not seem them correct on my system yet, though it is getting better. At first Python tasks were starting at 1338 days (!) and now are at 23 days to start. Interesting to hear some of yours are showing correct! What setup are you using in the hosts showing correct times?
On one my hosts a new Python started some 25 minutes ago. "Remaining time" is shown as 13 hrs.
No particular setup. In the past years, this host had crunched numerous ACEMD tasks. Since a few weeks ago, it's crunching Pythons. GTX980Ti. Besides, 2 "Theory" tasks from LHC are running. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
kksplace wrote:
Let me offer another possible "solution". (I am running two Python tasks on my system.) I found I had to change my Resource Share much, much higher for GPUGrid to effectively share other projects. ...
well, my target on this machine, in fact, is not to share Pythons with other projects.
It would simply make me happy if I could run 2 (or perhaps 3) Pythons simultaneously. The hardware requirements should be sufficient.
So, said that, I guess in this case the ressource share would not play any role.
BTW: as mentioned before, until some time early last week I did run two Pythons simultaneously on this PC. I have no idea though what the indicated remaining runtimes were. Most probably not that high as now, otherwise I could not have downloaded and started to Pythons in parallel.
So any idea what I can do to make this machine run at least 2 Pythons (if not 3) ??? |
|
|
kksplaceSend message
Joined: 4 Mar 18 Posts: 53 Credit: 2,591,271,749 RAC: 6,720,230 Level
Scientific publications
|
I am limited on any technical knowledge and can only speak how I got mine to work with 2 tasks. Sorry I can't help anymore. As to getting 3 tasks, my understanding from other posts and my own attempt is that you can't without a custom client or some other behind-the-scenes work. The '2 tasks at one time' limit is a GPUGrid restriction somewhere. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Yes, the project has a max 2 tasks per gpu limit with project max of 16 tasks.
You normally would just implement a app_config.xml file to get two tasks running concurrently on a gpu.
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
That has been the same quota since project inception. The only way to get around it is to spoof the gpu count via locking down the coproc_info.xml file in the BOINC folder. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
...
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
...
Keith, just for my understanding:
what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?
|
|
|
|
...
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>3.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
...
Keith, just for my understanding:
what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?
Exactly what I said in my previous message.
adjust your app_config file to reserve more CPU for your Python task to prevent BOINC from running too much extra work from other projects.
What Keith suggested would tell BOINC to reserve 3 whole CPU threads for each running PythonGPU task.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello!
Today I will deploy the changes tested last week in PythonGPUbeta to the PythonGPU app. The changes only affect Windows machines, and should results in downloading smaller initial files, and slightly less memory requirements.
As we discussed, for now the initial data unpacking still needs to be done in two steps, but using a more recent version of 7za.exe.
I did not detect any error in the PythonGPUbeta tasks, so hopefully this change will no affect jobs in PythonGPU either.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Keith, just for my understanding:
what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?
It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with.
That tells BOINC to not commit resources to other projects that it doesn't have so that you aren't running the cpu overcommitted.
It is only for BOINC scheduling of available resources. It does not impact the running of the Python task in any way directly. Only the scientific application itself deteremines how much cpu the task and application will use.
You should never run a cpu in overcommitted state because that means that EVERY application including internal housekeeping is constantly fighting for available resources and NONE are running optimally. IOW's . . . . slooooowwwly.
You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system.
If you have a cpu that has 16 cores/32 threads available to the OS, you should strive to use only up to 32 threads over the averaging periods.
The uptime command besides printing out how long the system has been up and running also prints out the 1 minute / 5 minute / 15 minute system average loadings.
As an example on this AMD 5950X cpu in this daily driver this is my uptime report.
keith@Pipsqueek:~$ uptime
00:15:16 up 7 days, 14:41, 1 user, load average: 30.16, 31.76, 32.03
The cpu is right at the limit of maximum utilization of its 32 threads.
So I am running it at 100% utilization most of the time.
If the averages were higher than 32, then that shows that the cpu is overcommitted and trying to do too much all the time and not running applications efficiently.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Thanks for the notice, abouh. Should make the Windows users a bit happier with the experience of crunching your work. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Keith, just for my understanding:
what exactly does the entry
<cpu_usage>3.0</cpu_usage>
do?
It tells BOINC to take 3 cpus away from the available resources that BOINC thinks it has to work with.
...
You can check your average cpu loading or utilization with the uptime command in the terminal. You should strive to get numbers that are less than the number of cores available to the operating system.
...
thanks, Keith, for the thorough explanation. Now everything is clear to me.
What concerns CPU loading/utilization, so far I have been taking a look at the Windows Task Manager which shows a (rough?) percentage on top of the column "CPU".
However, for me the question still is how I could get my host with the vast hardware ressources (as described here:
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59383) to run at least 2 Pythons concurrently - as it was the case already before ???
Isn't there a way go get these much too high "remaining time" figures back to real?
Or any other way to get more than 1 Python downloaded despite of these high figures? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Isn't there a way go get these much too high "remaining time" figures back to real?
Or any other way to get more than 1 Python downloaded despite of these high figures?
There isn't any way to get the estimated time remaining down to reasonable values as far as we know without a complete rewrite of the BOINC client code.
Or ask @kksplace how he managed to do it.
Try to increase your amount of day's cache to 10 and see if you pick up the second task.
Are you running with 0.5 gpu_usage via the app_config.xml file exampleI posted?
You can spoof 2 gpus being detected by BOINC which would automatically increase your gpu task allowance to 4 tasks. You need to modify the coproc_info.xml file and then lock it down to immutable state so BOINC can't rewrite it.
Google spoofing gpus in the Seti and BOINC forums on how to do that. |
|
|
|
Try to increase your amount of day's cache to 10 and see if you pick up the second task.
Counterintuitively, this can actually cause the opposite reaction on a lot of projects.
if you ask for "too much" work, some projects will just shut you out and tell you that no work is available, even when it is. I don't know why, I just know it happens. this is probably why he can't download work.
I would actually recommend keeping this value no larger than 2 days.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I was assuming that GPUGrid was the only project on his host.
I agree that increasing the value with more than one single project on the host is often deleterious. |
|
|
|
I think GPUGRID is one of the projects that reacts negatively to having the value too high.
but no, based on his daily contributions for this host via FreeDC, he's contributing to several projects.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I was assuming that GPUGrid was the only project on his host.
at the time I was trying to download and crunch 2 Pythons: YES - no other projects running at that time.
Meanwhile, until the problem get's solved, I have running 1 CPU and 1 GPU project on this host.
|
|
|
|
Today I will deploy the changes tested last week in PythonGPUbeta to the PythonGPU app. The changes only affect Windows machines, and should results in downloading smaller initial files, and slightly less memory requirements.
Thank you, abouh!
Let's try a new tasks :)
Now that's probably need to adjust disk space requirements for PythonGPU tasks, isn't it? |
|
|
|
I was assuming that GPUGrid was the only project on his host.
at the time I was trying to download and crunch 2 Pythons: YES - no other projects running at that time.
Meanwhile, until the problem get's solved, I have running 1 CPU and 1 GPU project on this host.
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.
this is what I did anyway |
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,569,300,388 RAC: 3,786,488 Level
Scientific publications
|
Good news since the recent changes to the Windows environment. I have seen a great increase of successful tasks. Seems that others have too as my ranking has dropped a bit.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
So good to hear that!
____________
|
|
|
|
When i paused workunit and restarted boinc boinc copied pythongpu_windows_x86_64__cuda1131.txz file in slot directory.
The file was already extracted to pythongpu_windows_x86_64__cuda1131.tar and deleted. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Ian&Steve C. wrote:
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.
as said before, I had done this change in the app_config.xml.
After a few days of having had run other projects on this host, I tried again GPUGRID.
After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever).
Then, the next disappointment:
although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status.
A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>".
In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs".
FYI, the complete app_config reads as follows:
<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect?
I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked.
Why does it not work with GPUGRID on this particular host?
This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side).
Can anyone give me hints as to what I could do?
|
|
|
|
You can reduce hard drive requirement by 1.93 GB if you remove these files from E:\programdata\BOINC\slots\1\Lib\site-packages\torch\lib when windows_fix.py has finished disabling ASLR and making .nv_fatb sections read-only.
05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak
05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak
03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak
05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak
05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak
05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak
03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak
05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak
Can you distribute these dlls already patched with python environment, or does NVIDIA license agreement forbid it? |
|
|
|
I just discovered the following problem on the PC which consists of:
2 CPUs Xeon E5 8-core / 16-HT each.
1 GPU Quadro P5000
128 GB Ramdisk
128 GB system memory
until a few days ago, I ran 2 Pythons simultaneously (with a setting in the app_config.xml: 0.5 gpu usage).
Now, while only 1 Python is running and I push the update button on the BOINC manager for fetching another Python, the BOINC event log tells me that no Pythons are available. Which is not the case though, as the server status page shows some 550 tasks for download; besides, I just downloaded one on another PC.
BTW: the Python tasks uses only some 50% of the processor - which seems logical with 2 CPUs inside.
So I tried to download tasks from other projects, and in all cases the event log says:
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
How can that be the case?
In the BOINC computing preferences, I now set the "store at least work" to 10 days, and under "store up to an additional" also 10 days. However, this did not solve the problem.
There is about 94GB free space on the Ramdisk, and some 150GB free system RAM.
What also catches my eye: on the one running Python, which right now shows 45% progress after come 10 hours, it shows a remaining runtime of 34 days!
Before, like on my other machines, remaining runtime for Pythons was indicated as 1-2 days.
Could this entry be the cause why nothing else can be downloaded and I get the message "job cache full"?
Can anyone help me to get out of this problem?
Meanwhile, the problem has become even worse:
After downloading 1 Python, it starts and in the BOINC manager it shows a remaing runtime of about 60 days (!!!). In reality, he task proceeds with normal speed and will be finished within 24 hours, like all other tasks before on this machine.
Hence, nothing else can be downoladed.
When trying to download tasks from other projects, it shows
not requesting tasks: don't need (CPU; NVIDIA GPU: job cache full).
when I try to download a second Python, it says "no tasks are available for Python apps for GPU hosts" which is not correct, there are some 150 available for download at the moment.
Can anyone give me advice how to get this problem solved?
You can add <fraction_done_exact/> to your app_config.xml
|
|
|
|
Ian&Steve C. wrote:
even if you solve the problem, you wont get more tasks until you change the GPUGRID task to use 0.5 GPU for 2x.
as said before, I had done this change in the app_config.xml.
After a few days of having had run other projects on this host, I tried again GPUGRID.
After all, I got 2 tasks downloaded (although I would have expected 4 since I had tweaked the coproc_info.xml to show 2 GPUs (so obviously this tweak has no effect, for what reason ever).
Then, the next disappointment:
although 2 Pythons were downloaded, only one started, the other one stayed in "ready to start" status.
A view on the status line of the inactive task revealed why so: it says "0.988 CPUs + 1 NVIDIA GPU". Although in the app_config.xml I have set "<gpu_usage>0.5</gpu_usage>".
In fact, I am using exactly the same app_config.xml on another host (with less hardware ressources), and there it works - 2 Pythons are crunched simultaneously, the status line of each task says "0.988 CPUs + 0.5 NVIDIA GPUs".
FYI, the complete app_config reads as follows:
<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>2</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
What could be the reason why neither the above mentioned entry in the coproc_info.xml nor the "0.5 GPU" entry in the app_config.xml have the expected effect?
I have been using these changes to 0.5 GPU (or even 0.33 and 0.25 GPU - when crunching WCG OPNG tasks) in various projects - it always worked.
Why does it not work with GPUGRID on this particular host?
This is especially annoying since this host has 2 CPUs and hence would be ideal for crunching 2 Pythons in parallel. Actually, I think that even 3 Pythons would work well (the VRAM of the GPU is 16GB, so no problem from this side).
Can anyone give me hints as to what I could do?
several things.
first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc.
next this line:
<max_concurrent>2</max_concurrent>
this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
several things.
first. after changing your app_config file to gpu_usage to 0.5, did you restart boinc or click "read config files" in the Options toolbar menu? you need to do this for any changes in your app_config to take effect. also even if you did click this, tasks downloaded as 1.0 GPU will not change their label to 0.5, but it will be treated as a 0.5 internally. to see this reflected in the task labeling you need to restart boinc.
next this line:
<max_concurrent>2</max_concurrent>
this will prevent more than 2 task from running. even if you download 4, only 2 will run. just letting you know in case this is not what you intended.
after changing an app_config file, I always click "read config files" in the Options toolbar menu. As said before, I have worked with app_config.xml files very often for several years, so I am for sure doing it correctly.
I know that tasks downloaded as 1.0 GPU will keep this label.
Here, this is not the question though. Because I had set the 0.5 GPU even before I started downloading Pythons. Since then, 5 Pythons were downloaded (3 of them finished and uploaded, 1 active, another one waiting to start), all of them show 1.0 GPU, for unknown reason.
I know the meaning of
<max_concurrent>2</max_concurrent>
thanks for the hint anyway.
So, as said before: it's totally unclear to me why in this case the app_config does not work. I see this problem for the first time in all the years :-(
What I could still try, after the currently running Python is over, to restart BOINC. Maybe this helps, however, I doubt it. |
|
|
|
what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file?
or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
what does your event log say about your app_config file? maybe you have some whitespace error in it that's causing boinc to not read it properly. when you click read config files, does boinc give any error/warning/complaint about the GPUGRID app_config file?
or check that the file is properly named as 'app_config.xml' and that there's no typo and located in your gpugrid project folder
I now double- and triple-checked everything you mentioned above.
Also, no error/warning/complaint after clicking read config files.
So this really is a huge conondrum :-(
What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1).
However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects.
So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to. |
|
|
|
but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location.
please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"
____________
|
|
|
|
What I now did was spoofing the GPU count info in the coproc_info.xml, which caused download of total of 4 Pythons, but only 2 running (okay, I want to be modest: 2 better than 1).
However, this cannot be the ultimate solution; since the GPU spoofing will have unwanted effects with other GPU projects.
So, at the bottom line: no idea what I can yet to to get this app_config work the way it's supposed to.
this is exactly what I would expect with the config you've described.
2x GPU spoofed = 4 tasks can download. if you have 2 running on a single GPU, then it's properly using 0.5 per GPU. the only way 2x can run on a single GPU is if the value 0.5 is being used. and only 2 running because of your max_concurrent statement (which you need for the spoofed GPU setup, otherwise it will try to run on the nonexistent second GPU and cause errors).
if you want to run 3x on a single GPU now, leave the GPU spoofing in place, change app_config to max_concurrent of 3, and change gpu_usage to 0.33
unless you know how to edit BOINC code and recompile a custom client, you will need to spoof the GPUs to get more tasks to download since the project enforces 2x tasks per GPU. there's no other solution.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
but what does the event log say? does it claim to find the gpugrid app_config file? what you're describing sounds like BOINC is not reading the file. which can be because there's an error in the file or because you don't have the file in the right location.
please confirm which directory contains your GPUGRID app_config file, and post the Event Log output after clicking "read config files"
sorry I had goofed before. The event log does complain, indeed:
10.10.2022 15:49:42 | GPUGRID | Found app_config.xml
10.10.2022 15:49:42 | GPUGRID | Missing </app> in app_config.xml
however, this does not make any sense, because </app> is not missing, is it?
<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact>
<max_concurrent>3</max_concurrent>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
</app_config>
(I had added the <fraction_done_exact> meanwhile)
As already said, this is exactly the same app which I use on another host, and there it works. I copied it.
And yes, the file is contained in the GPUGRID project folder.
|
|
|
|
the line <fraction_done_exact> is not right. that's breaking your file.
it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
the line <fraction_done_exact> is not right. that's breaking your file.
it needs to be <fraction_done_exact/>. you're missing the '/' before the close of the tag
OMG, shame on me :-(
Many thanks for your valuable help.
What I am questioning is how this error can happen by copying the file from another host (on which everything works fine).
Of course, it would have helped if the entry in the event log would have been a little clearer, it was referring to something else.
But anyway, the mistake was clearly on my side, and thanks again for your patience :-)
BTW, now 3 Pythons are running concurrently. Still, the load on the Quadro P5000 is moderate, the load on the 2 Xeon E5 is 100% each.
I will have to observe whether it would'nt make more sense to run 2 Pythons only.
|
|
|
|
Good day, abouh
I still see that unpacking is done by 2-step:
".\7za.exe" x pythongpu_windows_x86_64__cuda1131.txz -y
".\7za.exe" x pythongpu_windows_x86_64__cuda1131.tar -y
Is there any problem with implementing pipelined unpacking process? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
The app_config.xml code you posted is not valid as proclaimed by the XML validator.
An error has been found!
Click on to jump to the error. In the document, you can point at with your mouse to see the error message.
Errors in the XML document:
10: 3 The element type "fraction_done_exact" must be terminated by the matching end-tag "</fraction_done_exact>".
XML document:
1 <app_config>
2 <app>
3 <name>PythonGPU</name>
4 <fraction_done_exact>
5 <max_concurrent>3</max_concurrent>
6 <gpu_versions>
7 <gpu_usage>0.5</gpu_usage>
8 <cpu_usage>1.0</cpu_usage>
9 </gpu_versions>
10 </
app>
11 </app_config>
You should always check your syntax of your XML files at the validator.
https://www.xmlvalidation.com/index.php
|
|
|
|
And you shouldn't have a mid-line break, as shown in line 10. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
We, "Boincers" are like cows. If there are no WU's. we move on to greener pastures. Forget about running several WU's on one GPU, give my GPU's something to run. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
You should always check your syntax of your XML files at the validator.
https://www.xmlvalidation.com/index.php
Thanks, Keith, for the link. to be frank, I didn't know that such a validator exists. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Been around and published since early Seti days when we all had to do a lot of XML writing for custom app_info's and app_config's |
|
|
|
You can run something like this
cd e:\Program Files\BOINC
e:
:loop
TIMEOUT /T 10
boinccmd.exe --project https://www.gpugrid.net update
TIMEOUT /T 120
goto loop
or write something like that for bash. |
|
|
|
hey abouh,
I've noticed some new task names containing 'demos25_2-0-1' this differs from the majority of the previous tasks labelled as just 'demos25-0-1'.
can you briefly explain what is different about these tasks? also, the past few days (and mostly with these _2 tasks) the majority of the tasks have been either "early ending" or pre-coded to run a smaller number of iterations leading to very short runtimes (on the order of minutes instead of hours).
Thanks :)
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I notice a big difference in VRAM use between various Python tasks and/or systems, eg:
- GPU with running 3 tasks simultaneously: 5.250 MB
- GPU with running 2 tasks simultaneously: 5.012 MB
- GPU with running 2 tasks simulteanously: 8.055 MB
with the third one cited above I was lucky, VRAM of the GPU is 8.142 MB
(FYI, all values including a few hundred MB for the monitor).
Does anyone else make the same experience? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello Aleksey,
Yes, I struggled a bit with the single command solution. BOINC job requires specifying tasks in the following way.
<task>
<application>XXXXXX.exe</application>
<command_line>XXXXXXXXXXXXX"</command_line>
</task>
And this is the command that should work right?
7za x "X:\BOINC\projects\www.gpugrid.net\pythongpu_windows_x86_64__cuda1131.txz.1a152f102cdad20f16638f0f269a5a17" -so | 7za x -aoa -si -ttar
Isn't it actually using 7za 2 times? After some testing, the conclusion I arrived to is that in principle it actually requires 2 BOINC tasks to do it, because 7za decompresses .txz to .tar, and then .tar to plain files. The only way to do it in one task would be to compress the files into a format that 7za can decompress in a single call (like zip, but we already discussed that ziped filed are too big).
Does anyone know is that reasoning is correct? can BOINC wrappers execute commands like the one Aleksey suggested?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello, of course, let me explain
tasks names "demos25" and "demos25_2" belong to 2 different variants of the same experiment. In particular the selection of the agents sent to GPUGrid is different.
In both experiments the AI agents sent to GPUGrid learn using Reinforcement Learning, a machine learning technique that allows them to learn specific behaviours from interactions with their simulated environment (actually to make it faster they interact with 32 copies of the environment at the same time, the famous 32 threads). Also in both cases, when the agents "discover" something relevant, the job finishes and the info is sent back to be shared with the rest of the population.
The difference between "demos25" and "demos25_2" experiments is that in "demos25_2" I am experimenting with a more careful selection of the environment regions each agent is targeted to explore. I try to direct each agent to explore a different region of the environment (or with little overlap with the rest). The result is that agents in "demos25_2" are more likely to find something relevant that the rest of the population has not found yet and therefore more likely to finish earlier. The "demos25" experiment, contrarily, uses a more "brute force" approach, and as the population grows it becomes more difficult for new agents to discover new things.
I hope the explanation will make sense. Let me know if you have any other doubt, I will try to answer to it as well. There is also an experiment "demos25_3" in process which is similar to "demos25_2".
____________
|
|
|
|
Each task patches several dlls to disable ASLR and make .nv_fatb sections read-only and leaves 1.93 GB of backup files.
05.01.2022 10:28 70 403 584 cudnn_ops_train64_8.dll_bak
05.01.2022 10:23 88 405 504 cudnn_ops_infer64_8.dll_bak
03.08.2022 04:04 1 329 664 torch_cuda_cpp.dll_bak
05.01.2022 11:21 81 487 360 cudnn_cnn_train64_8.dll_bak
05.01.2022 10:36 129 872 896 cudnn_adv_infer64_8.dll_bak
05.01.2022 10:46 97 293 824 cudnn_adv_train64_8.dll_bak
03.08.2022 05:05 871 934 464 torch_cuda_cu.dll_bak
05.01.2022 11:15 736 718 848 cudnn_cnn_infer64_8.dll_bak
Can patched dlls be included in pythongpu_windows_x86_64__cuda1131.txz? |
|
|
|
I notice a big difference in VRAM use between various Python tasks and/or systems, eg:
- GPU with running 3 tasks simultaneously: 5.250 MB
- GPU with running 2 tasks simultaneously: 5.012 MB
- GPU with running 2 tasks simulteanously: 8.055 MB
with the third one cited above I was lucky, VRAM of the GPU is 8.142 MB
(FYI, all values including a few hundred MB for the monitor).
Does anyone else make the same experience?
more powerful GPUs will use more VRAM than less powerful GPUs, it scales roughly with core count of the GPU. so a 3090 would use more VRAM than say a 1050Ti on the same exact task. it's just the way it works when the GPU sets up the task, if the task has to scale to 10,000 cores instead of 2,000, it needs to use more memory.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
more powerful GPUs will use more VRAM than less powerful GPUs, it scales roughly with core count of the GPU.
okay, I see. Many thanks for explaining :-)
One thing here that's a pitty is that the GPU with the largest VRAM (Quadro P5000: 16GB) has the lowest number of cores (2.560) :-(
But, as so many times: one cannot have everything in life :-)
|
|
|
|
Is here anyone with NVIDIA A100 80GB? |
|
|
|
Is here anyone with NVIDIA A100 80GB?
only those with $10,000 to spare to use for free on DC. so likely no one ;) lol
faster GPUs don't provide much benefit for these tasks since they are so CPU bound. sure there's a lot of VRAM on this card, and maybe you could theoretically spin up 10-15 tasks on a single card, but unless you have A LOT of CPU power and bandwidth to feed it, you're gonna hit another bottleneck before you can hope to benefit from running that many tasks.
just 6x tasks maxes out my EPYC 7443P 48 threads @ 3.9GHz.
maybe in the future the project can get these tasks to the point where they lean more on the GPU tensor cores and a more GPU only environment, but for now it's mostly a CPU environment with a small contribution by the GPU.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
just wanted to download another Python task, but BOINC event log tells me the following:
13.10.2022 07:49:38 | GPUGRID | Nachricht vom Server: Python apps for GPU hosts needs 1296.10MB more disk space. You currently have 32082.50 MB available and it needs 33378.60 MB.
I wonder why a Python needs 33.378 MB free disk space.
Experience has shown that a Python takes some 8 GB disk space when being processed. So how come it says it needs 33GB ? |
|
|
|
Experience has shown that a Python takes some 8 GB disk space when being processed. So how come it says it needs 33GB ?
Check my previous post about space usage at PythonGPU startup stage.
Previously: tar.gz >> slotX (2,66 GiB) >> tar (5,48 GiB) >> app files (~8,13 GiB) = 16,27 GiB (Since archives(tar.gz & tar) were not deleted).
Now, after implementation of some improvements, at peak, consumption is about 13,61 GiB, and then(after startup stage) ~8,13 GiB.
In any case, it seems to require adjustment. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
In any case, it seems to require adjustment.
I agree |
|
|
|
Isn't it actually using 7za 2 times? After some testing, the conclusion I arrived to is that in principle it actually requires 2 BOINC tasks to do it
Yeah, it seems you are right.
Try use this:
<task>
<application>C:\Windows\System32\cmd.exe</application>
<command_line>/C ".\7za.exe x pythongpu_windows_x86_64__cuda1131.txz -so | .\7za.exe x -aoa -si -ttar"</command_line>
</task>
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Patching seemed to be required to run so many threads with pytorchrl as these jobs do. Otherwise windows used a lot of memory for every new thread. The script that does the patching is relatively fast. So doing it locally would not save a lot of time.
However, are you saying that after the patching some files could be deleted to further optimise memory use? If this is the case, I can look into it. These .dll_bak files? I am not very used to windows...
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Does anyone know if these requirements are estimated by BOINC and adjusted over time like completion time? or if manual adjustment is required?
____________
|
|
|
|
my runtime estimates have come down to basically reasonable and real levels now. so i think it will adjust on its own over time.
____________
|
|
|
|
abouh's message 59454 was in response to a question about disk storage requirements. No, they won't adjust themselves over time: the amount of disk space required by the task is set by the server, and the amount available to the client is calculated from readings taken of the current state of the host computer. They will only change if the user adjusts the hardware or BOINC client options, or the project staff adjust the job specifications passed to the workunit generator.
One the subject of runtimes: the (calculated) runtime estimation relies on just three things:
The job speed (sent by the server in the <app_version> specification).
The job size (again set on the server)
and the Duration Correction Factor (dynamically adjusted by the client)
SPEED seems to have fallen by approaching a half over the last month, but I haven't currently got a job I can verify that for.
SIZE has remained the same while I've been monitoring it.
DCF will have fallen dramatically - mine is now below 1 |
|
|
|
What can this output mean?
e00003a00008-ABOU_rnd_ppod_expand_demos25_9-0-1-RND2053
Update 464, num samples collected 118784, FPS 344
Algorithm: loss 0.1224, value_loss 0.0002, ivalue_loss 0.0113, rnd_loss 0.0307, action_loss 0.0846, entropy_loss 0.0043, mean_intrinsic_rewards 0.0421, min_intrinsic_rewards 0.0084, max_intrinsic_rewards 0.1857, mean_embed_dist 0.0000, max_embed_dist 0.0000, min_embed_dist 0.0000, min_external_reward 0.0000
Episodes: TrainReward 0.0000, l 360.6000, t 649.8340, UnclippedReward 0.0000, VisitedRooms 1.0000
REWARD DEMOS 25, INTRINSIC DEMOS 25, RHO 0.05, PHI 0.05, REWARD THRESHOLD 0.0, MAX DEMO REWARD -inf, INTRINSIC THRESHOLD 1000
FRAMES TO AVOID: 0
Update 465, num samples collected 122880, FPS 347
Algorithm: loss 0.1329, value_loss 0.0002, ivalue_loss 0.0098, rnd_loss 0.0317, action_loss 0.0955, entropy_loss 0.0043, mean_intrinsic_rewards 0.0414, min_intrinsic_rewards 0.0082, max_intrinsic_rewards 0.1516, mean_embed_dist 0.0000, max_embed_dist 0.0000, min_embed_dist 0.0000, min_external_reward 0.0000
Episodes: TrainReward 0.0000, l 341.3529, t 658.7952, UnclippedReward 0.0000, VisitedRooms 1.00000 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Nothing of any meaning or consequence for you. Pertinent only to the researcher. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
These are just the logs of the algorithm, printing out the relevant metrics during agent training.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I now have had 5 tasks in a row which failed after some 2.100 secs, one after the other, within about half an hour.
https://www.gpugrid.net/result.php?resultid=33098926
https://www.gpugrid.net/result.php?resultid=33100629
https://www.gpugrid.net/result.php?resultid=33100675
https://www.gpugrid.net/result.php?resultid=33100715
https://www.gpugrid.net/result.php?resultid=33100745
anyone any idea what is the problem?
On the same host, another task has been running for 22 hours now, but I have stopped download of new tasks until it's clear what's going on. |
|
|
|
I have seen continiously failed tasks starting today. According to the stderr_txt file I reckon there might be at least two, possibly related, errors.
File "C:\ProgramData\BOINC\slots\5\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\5\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\5\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'
- KeyError: 'StateEmbeddings'
- AttributeError: 'GWorker' object has no attribute 'batches'
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
*KeyError: 'StateEmbeddings'
*AttributeError: 'GWorker' object has no attribute 'batches'
exactly same thing I notice on all my failed tasks. |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Same here.
AttributeError: 'GWorker' object has no attribute 'batches' |
|
|
mrchipsSend message
Joined: 9 May 21 Posts: 16 Credit: 1,381,080,500 RAC: 2,667,452 Level
Scientific publications
|
my latest WU end with a computation error
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Your first task link shows 4 attempts at retrieving the necessary python libraries and failing.
But instead of just stopping right there it looks like it tried to compute anyway with the missing 'batches' library and all the subsequent tasks failed also becauses of the missing batches element.
Seems that the error flow map is not branching out to a proper halt early enough in the task to stop the computation and waste anymore time. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Six tasks, all in a row. Errored out. Seven now and another in the works. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
now the same problem on another host :-(
https://www.gpugrid.net/result.php?resultid=33101249
so, as seen by other members, too: all tasks which were downloaded within the past several hours seem to be faulty. |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
I joined yesterday and have 13 tasks failed in a row, all with the
AttributeError: 'GWorker' object has no attribute 'batches'.
Is this a failed installation? Should I try to reinstall this BOINC project from scratch?
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Is this a failed installation? Should I try to reinstall this BOINC project from scratch?
in view of the above said, the current tasks are probably faulty.
No need to reinstall, I guess
|
|
|
|
Yes - just received and returned result 33101290, on a machine which regularly returns good results.
That was replication _6 of a WU which everyone else had failed - a sure sign that the problem was with the workunit, not the host processing it. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Forty-six failed WU"s? Please stop sending them until the problem is resolved. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Forty-six failed WU"s? Please stop sending them until the problem is resolved.
+ 1 |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Forty-six failed WU"s? Please stop sending them until the problem is resolved.
+ 1
Sorry. After writing the post I looked at the other computer and it had downloaded another. It lasted three minutes or so. It was still in the unzipping process. I cannot understand the txt files so can someone who can check the files to see what is going on? |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
+1
33 fails in a row. I'll set this project to inactive and wait for a solution. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I cannot understand the txt files so can someone who can check the files to see what is going on?
the task are wrongly configured. Don't download them for the time being.
I guess we will get some kind of "go ahead" here once the problem is solved on the project-side.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello, thanks you for reporting the job errors. Sorry to all, there was an error on my side setting up a batch of experiment agents. The errors is due to the specific python script of this batch, not related to the application itself. I have just fixed it, and the new jobs should be running correctly. Unfortunately, some already submitted jobs are bound to fail… I apologise for the inconvenience. They will fail briefly after starting as reported, so not a lot of compute will be wasted.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
abouh,
could you also please make an adjustment (downwards) to the free disk space requirement of 33GB when downloading a Python task?
see my above Message 59449.
Many thanks :-) |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello! I have checked and the disk space used by the jobs is set to 35e9 bytes.
<rsc_disk_bound>35e9</rsc_disk_bound>
I will change it first to 20e9, let me know if it helps. I can further decreased it in the future if necessary.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Hello! I have checked and the disk space used by the jobs is set to 35e9 bytes.
<rsc_disk_bound>35e9</rsc_disk_bound>
I will change it first to 20e9, let me know if it helps. I can further decreased it in the future if necessary.
Thanks, Abouh, for your quick reaction. The change will definitely help - at least in my case with limited disk space due to Ramdisk.
|
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Hello, thanks you for reporting the job errors. Sorry to all, there was an error I have just fixed it, and the new jobs should be running correctly. Unfortunately, some already submitted jobs are bound to fail…
The problem is not fixed, I still get tasks that fail:
AttributeError: 'GWorker' object has no attribute 'batches'
|
|
|
|
I have recieved my first new working task. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
I wish I could get a sniff also. |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
I got another one this morning, still no luck, the task failed as all the other before. Is there something, that I have to change on my side?
This is the log file:
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
09:56:38 (11564): wrapper (7.9.26016): starting
09:56:38 (11564): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)
7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning the drive for archives:
1 file, 1976180228 bytes (1885 MiB)
Extracting archive: pythongpu_windows_x86_64__cuda1131.txz
--
Path = pythongpu_windows_x86_64__cuda1131.txz
Type = xz
Physical Size = 1976180228
Method = LZMA2:22 CRC64
Streams = 1523
Blocks = 1523
Cluster Size = 4210688
Everything is Ok
Size: 6410311680
Compressed: 1976180228
09:58:33 (11564): .\7za.exe exited; CPU time 111.125000
09:58:33 (11564): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
09:58:34 (11564): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
09:58:34 (11564): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)
7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning the drive for archives:
1 file, 6410311680 bytes (6114 MiB)
Extracting archive: pythongpu_windows_x86_64__cuda1131.tar
--
Path = pythongpu_windows_x86_64__cuda1131.tar
Type = tar
Physical Size = 6410311680
Headers Size = 19965952
Code Page = UTF-8
Characteristics = GNU LongName ASCII
Everything is Ok
Files: 38141
Size: 6380353601
Compressed: 6410311680
10:01:10 (11564): .\7za.exe exited; CPU time 41.140625
10:01:10 (11564): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
10:01:11 (11564): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
10:01:11 (11564): wrapper: running python.exe (run.py)
Starting!!
Windows fix!!
Define rollouts storage
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\4\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\4\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
10:05:44 (11564): python.exe exited; CPU time 2660.984375
10:05:44 (11564): app exit status: 0x1
10:05:44 (11564): called boinc_finish(195)
0 bytes in 0 Free Blocks.
442 bytes in 9 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 6550134 bytes.
Dumping objects ->
{10837} normal block at 0x0000024DEACAF4D0, 48 bytes long.
Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50
{10796} normal block at 0x0000024DEACAF310, 48 bytes long.
Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67
{10785} normal block at 0x0000024DEACAEA50, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{10774} normal block at 0x0000024DEACAEF20, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{10763} normal block at 0x0000024DEACAEB30, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{10752} normal block at 0x0000024DEACAF3F0, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{10671} normal block at 0x0000024DEAC990A0, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {10668} normal block at 0x0000024DEACB0A60, 8 bytes long.
Data: < {ìM > 00 00 7B EC 4D 02 00 00
{10030} normal block at 0x0000024DEAC9B890, 85 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{9426} normal block at 0x0000024DEACB0600, 8 bytes long.
Data: < ÇÊêM > 80 C7 CA EA 4D 02 00 00
..\zip\boinc_zip.cpp(122) : {545} normal block at 0x0000024DEACB12E0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{532} normal block at 0x0000024DEACA99C0, 32 bytes long.
Data: <ÐáÊêM ÀåÊêM > D0 E1 CA EA 4D 02 00 00 C0 E5 CA EA 4D 02 00 00
{531} normal block at 0x0000024DEACAE4E0, 52 bytes long.
Data: < r ÍÍ > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{526} normal block at 0x0000024DEACAE080, 43 bytes long.
Data: < p ÍÍ > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{521} normal block at 0x0000024DEACAE5C0, 44 bytes long.
Data: < ÍÍáåÊêM > 01 00 00 00 00 00 CD CD E1 E5 CA EA 4D 02 00 00
{516} normal block at 0x0000024DEACAE1D0, 44 bytes long.
Data: < ÍÍñáÊêM > 01 00 00 00 00 00 CD CD F1 E1 CA EA 4D 02 00 00
{506} normal block at 0x0000024DEACB39A0, 16 bytes long.
Data: < ãÊêM > 20 E3 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{505} normal block at 0x0000024DEACAE320, 40 bytes long.
Data: < 9ËêM input.zi> A0 39 CB EA 4D 02 00 00 69 6E 70 75 74 2E 7A 69
{498} normal block at 0x0000024DEACB3950, 16 bytes long.
Data: <è)ËêM > E8 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{497} normal block at 0x0000024DEACB3450, 16 bytes long.
Data: <À)ËêM > C0 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{496} normal block at 0x0000024DEACB3770, 16 bytes long.
Data: < )ËêM > 98 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{495} normal block at 0x0000024DEACB37C0, 16 bytes long.
Data: <p)ËêM > 70 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{494} normal block at 0x0000024DEACB3900, 16 bytes long.
Data: <H)ËêM > 48 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{493} normal block at 0x0000024DEACB3A40, 16 bytes long.
Data: < )ËêM > 20 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{491} normal block at 0x0000024DEACB35E0, 16 bytes long.
Data: <8úÊêM > 38 FA CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{490} normal block at 0x0000024DEACAA6E0, 32 bytes long.
Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69
{489} normal block at 0x0000024DEACB2E60, 16 bytes long.
Data: < úÊêM > 10 FA CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{488} normal block at 0x0000024DEAC9A460, 64 bytes long.
Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62
{487} normal block at 0x0000024DEACB31D0, 16 bytes long.
Data: <èùÊêM > E8 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{486} normal block at 0x0000024DEACAA3E0, 32 bytes long.
Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62
{485} normal block at 0x0000024DEACB3180, 16 bytes long.
Data: <ÀùÊêM > C0 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{484} normal block at 0x0000024DEACB3A90, 16 bytes long.
Data: < ùÊêM > 98 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{483} normal block at 0x0000024DEACB2DC0, 16 bytes long.
Data: <pùÊêM > 70 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{482} normal block at 0x0000024DEACB3720, 16 bytes long.
Data: <HùÊêM > 48 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{481} normal block at 0x0000024DEACB3040, 16 bytes long.
Data: < ùÊêM > 20 F9 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{480} normal block at 0x0000024DEACB36D0, 16 bytes long.
Data: <øøÊêM > F8 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{479} normal block at 0x0000024DEACA9DE0, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{478} normal block at 0x0000024DEACB3C70, 16 bytes long.
Data: <ÐøÊêM > D0 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{477} normal block at 0x0000024DEACA9F00, 32 bytes long.
Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30
{476} normal block at 0x0000024DEACB39F0, 16 bytes long.
Data: <¨øÊêM > A8 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{475} normal block at 0x0000024DEACAA2C0, 32 bytes long.
Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41
{474} normal block at 0x0000024DEACB3B80, 16 bytes long.
Data: < øÊêM > 80 F8 CA EA 4D 02 00 00 00 00 00 00 00 00 00 00
{473} normal block at 0x0000024DEACAF880, 480 bytes long.
Data: < ;ËêM À¢ÊêM > 80 3B CB EA 4D 02 00 00 C0 A2 CA EA 4D 02 00 00
{472} normal block at 0x0000024DEACB3AE0, 16 bytes long.
Data: < )ËêM > 00 29 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{471} normal block at 0x0000024DEACB3310, 16 bytes long.
Data: <Ø(ËêM > D8 28 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{470} normal block at 0x0000024DEACB3590, 16 bytes long.
Data: <°(ËêM > B0 28 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{469} normal block at 0x0000024DEACAE160, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{468} normal block at 0x0000024DEACB3630, 16 bytes long.
Data: <ø'ËêM > F8 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{467} normal block at 0x0000024DEACB2FF0, 16 bytes long.
Data: <Ð'ËêM > D0 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{466} normal block at 0x0000024DEACB3B30, 16 bytes long.
Data: <¨'ËêM > A8 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{465} normal block at 0x0000024DEACB3400, 16 bytes long.
Data: < 'ËêM > 80 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{464} normal block at 0x0000024DEACB34F0, 16 bytes long.
Data: <X'ËêM > 58 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{463} normal block at 0x0000024DEACB38B0, 16 bytes long.
Data: <0'ËêM > 30 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{462} normal block at 0x0000024DEACB3220, 16 bytes long.
Data: < 'ËêM > 10 27 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{461} normal block at 0x0000024DEACB32C0, 16 bytes long.
Data: <è&ËêM > E8 26 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{460} normal block at 0x0000024DEACAA500, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{459} normal block at 0x0000024DEACB3130, 16 bytes long.
Data: <À&ËêM > C0 26 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{458} normal block at 0x0000024DEACAE0F0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{457} normal block at 0x0000024DEACB3270, 16 bytes long.
Data: < &ËêM > 08 26 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{456} normal block at 0x0000024DEACB3BD0, 16 bytes long.
Data: <à%ËêM > E0 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{455} normal block at 0x0000024DEACB3860, 16 bytes long.
Data: <¸%ËêM > B8 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{454} normal block at 0x0000024DEACB3540, 16 bytes long.
Data: < %ËêM > 90 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{453} normal block at 0x0000024DEACB2D20, 16 bytes long.
Data: <h%ËêM > 68 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{452} normal block at 0x0000024DEACB2F50, 16 bytes long.
Data: <@%ËêM > 40 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{451} normal block at 0x0000024DEACB2FA0, 16 bytes long.
Data: < %ËêM > 20 25 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{450} normal block at 0x0000024DEACB3680, 16 bytes long.
Data: <ø$ËêM > F8 24 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{449} normal block at 0x0000024DEACB3810, 16 bytes long.
Data: <Ð$ËêM > D0 24 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{448} normal block at 0x0000024DEACAE780, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{447} normal block at 0x0000024DEACB2E10, 16 bytes long.
Data: < $ËêM > 18 24 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{446} normal block at 0x0000024DEACB2F00, 16 bytes long.
Data: <ð#ËêM > F0 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{445} normal block at 0x0000024DEACB2D70, 16 bytes long.
Data: <È#ËêM > C8 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{444} normal block at 0x0000024DEACB33B0, 16 bytes long.
Data: < #ËêM > A0 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{443} normal block at 0x0000024DEACB3360, 16 bytes long.
Data: <x#ËêM > 78 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{442} normal block at 0x0000024DEACB34A0, 16 bytes long.
Data: <P#ËêM > 50 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{441} normal block at 0x0000024DEACB04C0, 16 bytes long.
Data: <0#ËêM > 30 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{440} normal block at 0x0000024DEACB08D0, 16 bytes long.
Data: < #ËêM > 08 23 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{439} normal block at 0x0000024DEACAA380, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{438} normal block at 0x0000024DEACB02E0, 16 bytes long.
Data: <à"ËêM > E0 22 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{437} normal block at 0x0000024DEACAE710, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{436} normal block at 0x0000024DEACB0010, 16 bytes long.
Data: <("ËêM > 28 22 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{435} normal block at 0x0000024DEACAFF20, 16 bytes long.
Data: < "ËêM > 00 22 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{434} normal block at 0x0000024DEACB0880, 16 bytes long.
Data: <Ø!ËêM > D8 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{433} normal block at 0x0000024DEACB01A0, 16 bytes long.
Data: <°!ËêM > B0 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{432} normal block at 0x0000024DEACB0970, 16 bytes long.
Data: < !ËêM > 88 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{431} normal block at 0x0000024DEACB0150, 16 bytes long.
Data: <`!ËêM > 60 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{430} normal block at 0x0000024DEACB0E70, 16 bytes long.
Data: <@!ËêM > 40 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{429} normal block at 0x0000024DEACB06A0, 16 bytes long.
Data: < !ËêM > 18 21 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{428} normal block at 0x0000024DEACB0E20, 16 bytes long.
Data: <ð ËêM > F0 20 CB EA 4D 02 00 00 00 00 00 00 00 00 00 00
{427} normal block at 0x0000024DEACB20F0, 2976 bytes long.
Data: < ËêM .\7za.ex> 20 0E CB EA 4D 02 00 00 2E 5C 37 7A 61 2E 65 78
{66} normal block at 0x0000024DEACA3AB0, 16 bytes long.
Data: < 껤ö > 80 EA BB A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x0000024DEACA42D0, 16 bytes long.
Data: <@黤ö > 40 E9 BB A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x0000024DEACA3B50, 16 bytes long.
Data: <øW¸¤ö > F8 57 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000024DEACA4460, 16 bytes long.
Data: <ØW¸¤ö > D8 57 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000024DEACA46E0, 16 bytes long.
Data: <P ¸¤ö > 50 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000024DEACA4280, 16 bytes long.
Data: <0 ¸¤ö > 30 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000024DEACA3A60, 16 bytes long.
Data: <à ¸¤ö > E0 02 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000024DEACA4140, 16 bytes long.
Data: < ¸¤ö > 10 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000024DEACA3CE0, 16 bytes long.
Data: <p ¸¤ö > 70 04 B8 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000024DEACA4690, 16 bytes long.
Data: < À¶¤ö > 18 C0 B6 A4 F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.
</stderr_txt>
]]> |
|
|
|
An example of that: workunit 27329338 has failed for everyone, mine after about 10%. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I am sorry, old batch jobs are still being mixed with new ones that do run successfully (I have been monitoring them). BOINC will eventually run out of bad jobs, the problems is that it attempts to run them 8 times...
____________
|
|
|
|
the problems is that it attempts to run them 8 times...
Look at that last workunit link. Above the list, it says:
max # of error/total/success tasks 7, 10, 6
That's configurable by the project, I think at the application level. You might be able to reduce it a bit? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yesterday I was unable to find the specific parameter that defines the number of job attempts. I will ask the main admin. Maybe it is set for all applications.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
From looking at the server code in create_work.cpp module, the parameter is pulled from the work unit template file.
You need to change the input (infile1, infile2 ...) file that feeds into the wu template file. Or directly change the wu template file.
Refer to these documents.
https://boinc.berkeley.edu/trac/wiki/JobSubmission
https://boinc.berkeley.edu/trac/wiki/JobTemplates#Inputtemplates |
|
|
|
Found some documentation: in https://boinc.berkeley.edu/trac/wiki/JobSubmission
The following job parameters may be passed in the input template, or as command-line arguments to create_work; the input template has precedence. If not specified, the given defaults will be used.
--target_nresults x
default 2
--max_error_results x
default 3
--max_total_results x
default 10
--max_success_results x
default 6
I can't find any similar detail for Local web-based job submission or Remote job submission, but it must be buried somewhere in there. You're not using the stated default values, so somebody at GPUGrid must have found it at least once! |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Abouh wrote: Hello, thanks you for reporting the job errors. Sorry to all, there was an error on my side setting up a batch of experiment agents. ... They will fail briefly after starting as reported, so not a lot of compute will be wasted.
well, whatever "they will fail briefly after starting" means :-)
Mine are failing after 3.780 - 8.597 seconds :-(
Is there no way to call them back or delete them from the server?
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I see these values can be set in the app workunit template as mentioned
--max_error_results x
default 3
--max_total_results x
default 10
--max_success_results x
default 6
I have checked, and for PythonGPU and PythonGPU apps the parameters are not specified, so the default values should apply (also coherent with the info previously posted).
However, the number of times the server attempts to solve a task by sending it to a GPUGrid machine before giving up is 8. So it does not seem like it is specified by these parameters to me (shouldn't it be 3 according to the default value?).
I have asked for help to the admin server, maybe the parameters are overwritten somewhere else. Even if not for this time, it will be convenient to know to solve future issues like this one. Sorry again for the problems.
____________
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Abouh wrote: Hello, thanks you for reporting the job errors. Sorry to all, there was an error on my side setting up a batch of experiment agents. ... They will fail briefly after starting as reported, so not a lot of compute will be wasted.
well, whatever "they will fail briefly after starting" means :-)
Mine are failing after 3.780 - 8.597 seconds :-(
Is there no way to call them back or delete them from the server?
Not anymore. Anyway, after 9.45 UTC something seems to have changed. I have two Wu's (fingers crossed and touch wood) that have reached 35% in six hours. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
can someone give me advice with regard to the following dilemma:
Until last week, on my host with 2 RTX3070 inside I could process 2 Pythons concurrently on each GPU, i.e. 4 Pythons at a time.
On device_0 VRAM became rather tight - it comes with 8.192MB, about 300-400MB were used for the monitor, and with the two Pythons the total VRAM usage was at around 8.112MB (as said: tight, but it worked fine).
On device_1 it was not that tight, since no VRAM usage for the monitor.
Since yesterday I notice that device_0 uses about 1.400MB for the monitor - no idea why. So no way to process 2 Pythons concurrently.
And no way for device_2 to run 2 Pythons either, because any additional Python beyond the one running on device_0 and the one running on device_1 would automatically start on device_0.
Hence, my question to the experts here: is there a way to tell the third Python to run on device_1, instead of device_0 ?
Or, any idea how I could lower the VRAM usage for the monitor on device_0? As said, it was much less before, all of a sudden it jumped up (I was naiv enough to connect the monitor cable to device_1 - which did, of course, not work).
Or any other ideas? |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Finally, WU #38 worked and was completed within two hours. Thanks, earned my first points here. |
|
|
|
reboot the system and free up the VRAM maybe.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Browser tabs are notorious RAM eaters. Both in the cpu and gpu if you have hardware acceleration enabled in the browser.
You can use gpu_exclude statement in the cc_config.xml file to keep a gpu task off specific gpus. I do that for keeping the tasks off my fastest gpus which run other projects.
But that is permanent for the BOINC session that is booted. You would have to edit cc_config files for different sessions and boot what you need as necessary to get around this issue. Doable but cumbersome.
|
|
|
|
Browser tabs are notorious RAM eaters. Both in the cpu and gpu if you have hardware acceleration enabled in the browser.
good call. forgot the browser can use some GPU resources. that's a good thing to check.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
many thanks, folks, for your replies regarding my VRAM problem.
I have rebooted, and the VRAM usage of device_0 was almost 2GB. No browser open, no other apps either (except GPU-Z, the MSI Afterburner, MemInfo, DUMeter, and the Windows Task Manager - these apps had been present before, too).
Now, with processing 1 Python on each GPU, the VRAM situation is as follows:
device_0: 6.034MB
device_1: 3.932MB
hence, a second Python could be run on device_1.
I know about the "gpu_exclude" thing in the cc_config.xml, but for sure this is a very cumbersome method; and I am not even sure whether in Windows a running Python survives a BOINC reboot (I think a did that once before, for a different reason, and the Python was gone).
The only thing I could try again is to open the second instance of BOINC which I had configured some time ago, with the "gpu_exclude" provision for device_0.
However, when I tried this out, everything crashed after a short while (1 or 2 hours). I did not find out why. Perhaps it was simply a coincidence and would not happen again?
It's really a pitty that with all these various configuration possibilities via cc_config (and also app_config.xml) there is no way to have a configuration available which would solve my problem :-(
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I think you may have to accept the tasks are what they are. Variable because of the different parameter sets. Some may use little RAM and some may use a lot.
So you may not always be able to run doubles on your 8GB cards. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I think you may have to accept the tasks are what they are. Variable because of the different parameter sets. Some may use little RAM and some may use a lot.
So you may not always be able to run doubles on your 8GB cards.
yes, meanwhile I noticed on the other two hosts which are running Pythons ATM: the amount of VRAM used varies.
No problem of course on the host with the Quadro P5000 which comes with 16GB. Out of which only some 7.5GB are being used even with 4 tasks in parallel, due to the lower number of CUDA cores of this GPU. |
|
|
|
are newer tasks using more VRAM? or is there something on your system using more VRAM?
what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram
hm, I will have to find a tool that tells me :-)
Any recommendation?
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
what is the breakdown of VRAM used by the different processes? that will tell you what process is actually using the vram
hm, I will have to find a tool that tells me :-)
Any recommendation?
nvidia-smi in the Terminal does nicely. |
|
|
|
check here for nvidia-smi use on Windows. it's easy on Linux, but less intuitive on Windows
https://stackoverflow.com/questions/57100015/how-do-i-run-nvidia-smi-on-windows
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones.
So there must be tons of those still in the bucket :-(
Just noticed that a task failed after >19 hours. This is not nice :-( |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
my hosts still keep receiving faulty tasks which are totally "fresh", no re-submitted ones.
So there must be tons of those still in the bucket :-(
Just noticed that a task failed after >19 hours. This is not nice :-(
I was out for a few hours, and when I came back, I noticed 2 more failed tasks (both ran for almost 3 hours before they crashed).
Whereas at the beginning of the problem, the tasks failed - as also Abouh noted - within short time so that there was not too much of waste, now these tasks fail only after several hours.
Within the past 24 hours, my hosts' total computation time of all the failing tasks was 104.526 seconds = 29 hours!
I am very much willing to support the science with my time, my equipment and my permanently increasing electricity bill as long as it makes sense (and as long as I can afford it).
FYI, my electricity costs have more than tripled since the beginning of the year, for known reasons. That's significant!
I simply cannot believe that all these faulty tasks in the big download bucket cannot be stopped, retrieved, cancelled or what ever else. It makes absolutely no sense to leave them in there and send them out to us for the next several weeks.
If the GPUGRID people cannot confirm that they are finding a way quickly to stop these faulty tasks, I have no other choice, as sorry as I am, to switch to other projects :-(
|
|
|
|
For the time being, I already suspended receiving new tasks and reverted back to E@H & F@H as long as this situation with faulty tasks has been sorted out. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Most peculiar, I have had no failed task. Seven so far.
I wish with internet problems we could also get a standby task.
Maybe they are sending these tasks to those multiple WUs crunching machines who can quickly clear up the backlog :) |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I have reviewed your recent tasks and there is a mix of faulty and successful tasks. The successful ones are newer and are the only ones being submitted now.
I could not figure out how to cancel the faulty tasks earlier. However, they should be almost all if not all crunched by now.
Maybe other hosts can confirm if they are still getting tasks that crash, but I expect the problem to be solved now. For the last 2-3 days only good tasks have been sent.
____________
|
|
|
|
@ Erich56: you have to look into the history and the reason for the crashes. I got one of the last replications from workunit 27327972 last night - but that's one that was created on 16 October, almost a week ago. it's just that the first owner hung on to it for five days and did nothing. That's not the project's fault, even if the initial error was. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
For the last 2-3 days only good tasks have been sent.
thanks, Abouh, for your reply.
When you say what I quoted above - you are talking about "fresh" tasks, right?
However, repetitions (up to 8) of the former, faulty tasks are still going out.
Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours:
https://www.gpugrid.net/result.php?resultid=33112434 |
|
|
|
Likewise. Since I posted, I've received another one which is likely to go the same way, from workunit 27328975. Another 5-day no-show by a Science United user. |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.
Mine are set to ten plus ten days but I still get one. This is not the reason. |
|
|
|
For the last 2-3 days only good tasks have been sent.
thanks, Abouh, for your reply.
When you say what I quoted above - you are talking about "fresh" tasks, right?
However, repetitions (up to 8) of the former, faulty tasks are still going out.
Just an example of a task which one of my hosts received this morning, and which failed after about 2 1/2 hours:
https://www.gpugrid.net/result.php?resultid=33112434
When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.
____________
|
|
|
|
Maybe, that person misjugded how long these tasks can take. If that person asked for a stock of tasks for 7 or 10 days, some probably will never start on that machine before the deadline.
Ideally, when the task approaches the deadline it should jump into high priority mode and jump to the front of the line for task priority. But the process doesn’t always work ideally with BOINC.
But there are also many people who blindly download tasks then shut off their computer for extended periods of time.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, I meant fresh tasks, which would be sent out to for the first time out of 8 possible attempts.
Yes, repetitions are an issue. I understand why it was set to a relatively high value. Many machines with limited GPU memory (e.g. 2Gb) or configuration problems are in the network are fail inevitably with this tasks. That gave the experiments some error tolerance.
However, ideally I would like to be able to modify it just for the python apps momentarily for cases like this one. I could set it to 1 for a few hours so all bad tasks are process fast and then go back to 8.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Ian&Steve C. wrote:
When you get a resend, especially a high number resend like that, check the reason that it was resent so much. If there’s tons of errors, probably safe to just abort it and not waste your time on it. Especially when you know a bunch of bad tasks had gone out recently.
Well, not a bad idea, if I had the time to babysit my hosts 24/7 :-)
However, this would end up with a problem rather quickly: isn't it still the case that once a certain number of downloaded tasks is being deleted, no further ones will be sent within the following 24 hours?
In fact, I remember that this was even true for failing tasks in the past, based on the assumption that there is something wrong with the host. So, in view of the many failed tasks now, I am surprised that I still get new ones within the mentioned 24 hours ban.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Depends on how they have set up the server software.
There are BOINC configs so that "bad actors" are put into timeout mode when they return a large number of bad results in a short time period. The 24 hour timeout you mentioned.
Once a host starts returning valid results, they are given increasing amounts of work on each scheduler request. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Crazy, I had another task which failed after more than 20 hours :-(
I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while.
There was another task yesterday which failed after almost 20 hours.
And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour.
My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks.
I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon).
So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-(
What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Must be a Windows thing. None of my "bad" formatted tasks run longer than ~40 minutes or so before failing out.
Yes, there are many flaws with BOINC, but unless you can develop a better solution, you will have to use what we have.
Sorry to have you leave the project. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Crazy, I had another task which failed after more than 20 hours :-(
I could live with the situation when a task fails after say 20 minutes or half an hour, once in a while.
There was another task yesterday which failed after almost 20 hours.
And there were numerous tasks in addition which failed after less than one hour but also after much more than one hour.
My assumption is that these misconfigured tasks with 8 repetitions each will be around for many more weeks.
I am sorry but I no longer can live with this waste, particularly with what electricity here costs by now (and getting even more expensive soon).
So I put GPUGRID on NNT and will crunch other projects. As sorry as I am for this step :-(
What I hope is that one day BOINC will develop a mechanism for calling back faulty batches. And I don't understand why this is not possible so far.
The tasks that were failing were taking around three minutes not twenty hours. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
The tasks that were failing were taking around three minutes not twenty hours.
for sure NOT 3 minutes. Example here:
20 Oct 2022 | 1:19:26 UTC 20 Oct 2022 | 2:57:36 UTC Error while computing 3,780.66 3,780.66 --- Python apps for GPU hosts v4.04 (cuda1131)
so, in above example, the task failed after 1 Hr 38 mins.
20 Oct 2022 | 1:44:50 UTC 20 Oct 2022 | 3:08:40 UTC Error while computing 5,195.80 5,195.80 --- Python apps for GPU hosts v4.04 (cuda1131)
here, the task failed after 1 hr 23 mins.
but, interestingly enough, here the relation is quite different:
22 Oct 2022 | 6:41:59 UTC 22 Oct 2022 | 7:07:44 UTC Error while computing 70,694.64 70,694.64 --- Python apps for GPU hosts v4.04 (cuda1131)
the task obviously failed after 25 minutes, although runtime and CPU time as indicated would suggest >19 hrs.
These indications are somewhat unclear (to me).
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You MUST absolutely ignore any reported times for cpu_time and run_time for the Python tasks.
The numbers are meaningless. BOINC is unable to correctly calculate the times because of the dual cpu-gpu nature of the tasks.
If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml.
The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc.
The only time that is meaningful is the elapsed time between time task sent and time task result is reported. That is the closest we can get to figuring out the true elapsed time. But if you carry a large cache, then dead time sitting in your cache awaiting the chance to run inflates the true time.
Since I only carry a single task at any time, I report one task and receive its replacement on the same scheduler connection so I know my elapsed time is pretty close to the actual difference between sent time and reported time. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
I get one task at a time also.
Anyway, I got one failure today task 33115748. It has failed seven times already with one timed out. It is waiting to go to someone once more.
Stderr output
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
06:36:47 (12932): wrapper (7.9.26016): starting
06:36:47 (12932): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)
7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning the drive for archives:
1 file, 1976180228 bytes (1885 MiB)
Extracting archive: pythongpu_windows_x86_64__cuda1131.txz
--
Path = pythongpu_windows_x86_64__cuda1131.txz
Type = xz
Physical Size = 1976180228
Method = LZMA2:22 CRC64
Streams = 1523
Blocks = 1523
Cluster Size = 4210688
Everything is Ok
Size: 6410311680
Compressed: 1976180228
06:38:33 (12932): .\7za.exe exited; CPU time 100.578125
06:38:33 (12932): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
06:38:34 (12932): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
06:38:34 (12932): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)
7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning the drive for archives:
1 file, 6410311680 bytes (6114 MiB)
Extracting archive: pythongpu_windows_x86_64__cuda1131.tar
--
Path = pythongpu_windows_x86_64__cuda1131.tar
Type = tar
Physical Size = 6410311680
Headers Size = 19965952
Code Page = UTF-8
Characteristics = GNU LongName ASCII
Everything is Ok
Files: 38141
Size: 6380353601
Compressed: 6410311680
06:39:39 (12932): .\7za.exe exited; CPU time 21.781250
06:39:39 (12932): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
06:39:40 (12932): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
06:39:40 (12932): wrapper: running python.exe (run.py)
Starting!!
Windows fix!!
Define rollouts storage
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
AttributeError: 'GWorker' object has no attribute 'batches'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 475, in <module>
main()
File "run.py", line 131, in main
learner.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\0\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 251, in collect_train_data
self.storage.insert_transition(transition)
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in insert_transition
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
File "C:\ProgramData\BOINC\slots\0\python_dependencies\buffer.py", line 794, in <listcomp>
state_embeds = [i["StateEmbeddings"] for i in sample[prl.INFO]]
KeyError: 'StateEmbeddings'
06:44:10 (12932): python.exe exited; CPU time 1673.984375
06:44:10 (12932): app exit status: 0x1
06:44:10 (12932): called boinc_finish(195)
0 bytes in 0 Free Blocks.
554 bytes in 9 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 4443701 bytes.
Dumping objects ->
{11071} normal block at 0x000002340B7911E0, 48 bytes long.
Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50
{11030} normal block at 0x000002340B791090, 48 bytes long.
Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67
{11019} normal block at 0x000002340B791170, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{11008} normal block at 0x000002340B790FB0, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{10997} normal block at 0x000002340B790D80, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{10986} normal block at 0x000002340B791020, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{10905} normal block at 0x0000023409C90AB0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {10902} normal block at 0x0000023409C8E2D0, 8 bytes long.
Data: < _ 4 > 00 00 5F 0B 34 02 00 00
{10127} normal block at 0x0000023409C909E0, 141 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{9380} normal block at 0x0000023409C8E550, 8 bytes long.
Data: < ÊË 4 > 80 CA CB 09 34 02 00 00
..\zip\boinc_zip.cpp(122) : {544} normal block at 0x0000023409C90B80, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{531} normal block at 0x0000023409C8A430, 32 bytes long.
Data: <0‹È 4 Ð†È 4 > 30 8B C8 09 34 02 00 00 D0 86 C8 09 34 02 00 00
{530} normal block at 0x0000023409C88A50, 52 bytes long.
Data: < r ÍÍ > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{525} normal block at 0x0000023409C88580, 43 bytes long.
Data: < p ÍÍ > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{520} normal block at 0x0000023409C886D0, 44 bytes long.
Data: < ÍÍñ†È 4 > 01 00 00 00 00 00 CD CD F1 86 C8 09 34 02 00 00
{515} normal block at 0x0000023409C88B30, 44 bytes long.
Data: < ÍÍQ‹È 4 > 01 00 00 00 00 00 CD CD 51 8B C8 09 34 02 00 00
{505} normal block at 0x0000023409C910C0, 16 bytes long.
Data: < …È 4 > 10 85 C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{504} normal block at 0x0000023409C88510, 40 bytes long.
Data: <À É 4 input.zi> C0 10 C9 09 34 02 00 00 69 6E 70 75 74 2E 7A 69
{497} normal block at 0x0000023409C90EE0, 16 bytes long.
Data: < &É 4 > 08 26 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{496} normal block at 0x0000023409C91610, 16 bytes long.
Data: <à%É 4 > E0 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{495} normal block at 0x0000023409C91C00, 16 bytes long.
Data: <¸%É 4 > B8 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{494} normal block at 0x0000023409C90DA0, 16 bytes long.
Data: < %É 4 > 90 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{493} normal block at 0x0000023409C918E0, 16 bytes long.
Data: <h%É 4 > 68 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{492} normal block at 0x0000023409C90D50, 16 bytes long.
Data: <@%É 4 > 40 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{490} normal block at 0x0000023409C912F0, 16 bytes long.
Data: < É 4 > 88 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{489} normal block at 0x0000023409C89BF0, 32 bytes long.
Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69
{488} normal block at 0x0000023409C90E40, 16 bytes long.
Data: <` É 4 > 60 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{487} normal block at 0x0000023409C75300, 64 bytes long.
Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62
{486} normal block at 0x0000023409C912A0, 16 bytes long.
Data: <8 É 4 > 38 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{485} normal block at 0x0000023409C8A3D0, 32 bytes long.
Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62
{484} normal block at 0x0000023409C91CA0, 16 bytes long.
Data: < É 4 > 10 00 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{483} normal block at 0x0000023409C91200, 16 bytes long.
Data: <èÿÈ 4 > E8 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{482} normal block at 0x0000023409C91C50, 16 bytes long.
Data: <ÀÿÈ 4 > C0 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{481} normal block at 0x0000023409C91110, 16 bytes long.
Data: < ÿÈ 4 > 98 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{480} normal block at 0x0000023409C91BB0, 16 bytes long.
Data: <pÿÈ 4 > 70 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{479} normal block at 0x0000023409C91520, 16 bytes long.
Data: <HÿÈ 4 > 48 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{478} normal block at 0x0000023409C8A790, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{477} normal block at 0x0000023409C90FD0, 16 bytes long.
Data: < ÿÈ 4 > 20 FF C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{476} normal block at 0x0000023409C8A310, 32 bytes long.
Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30
{475} normal block at 0x0000023409C913E0, 16 bytes long.
Data: <øþÈ 4 > F8 FE C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{474} normal block at 0x0000023409C89FB0, 32 bytes long.
Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41
{473} normal block at 0x0000023409C91070, 16 bytes long.
Data: <ÐþÈ 4 > D0 FE C8 09 34 02 00 00 00 00 00 00 00 00 00 00
{472} normal block at 0x0000023409C8FED0, 480 bytes long.
Data: <p É 4 °ŸÈ 4 > 70 10 C9 09 34 02 00 00 B0 9F C8 09 34 02 00 00
{471} normal block at 0x0000023409C91B10, 16 bytes long.
Data: < %É 4 > 20 25 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{470} normal block at 0x0000023409C90F80, 16 bytes long.
Data: <ø$É 4 > F8 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{469} normal block at 0x0000023409C91AC0, 16 bytes long.
Data: <Ð$É 4 > D0 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{468} normal block at 0x0000023409C88820, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{467} normal block at 0x0000023409C91660, 16 bytes long.
Data: < $É 4 > 18 24 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{466} normal block at 0x0000023409C914D0, 16 bytes long.
Data: <ð#É 4 > F0 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{465} normal block at 0x0000023409C91890, 16 bytes long.
Data: <È#É 4 > C8 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{464} normal block at 0x0000023409C91A70, 16 bytes long.
Data: < #É 4 > A0 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{463} normal block at 0x0000023409C90E90, 16 bytes long.
Data: <x#É 4 > 78 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{462} normal block at 0x0000023409C91570, 16 bytes long.
Data: <P#É 4 > 50 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{461} normal block at 0x0000023409C8E960, 16 bytes long.
Data: <0#É 4 > 30 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{460} normal block at 0x0000023409C8E910, 16 bytes long.
Data: < #É 4 > 08 23 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{459} normal block at 0x0000023409C89A10, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{458} normal block at 0x0000023409C8E8C0, 16 bytes long.
Data: <à"É 4 > E0 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{457} normal block at 0x0000023409C889E0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{456} normal block at 0x0000023409C8E7D0, 16 bytes long.
Data: <("É 4 > 28 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{455} normal block at 0x0000023409C8E4B0, 16 bytes long.
Data: < "É 4 > 00 22 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{454} normal block at 0x0000023409C8E820, 16 bytes long.
Data: <Ø!É 4 > D8 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{453} normal block at 0x0000023409C8E780, 16 bytes long.
Data: <°!É 4 > B0 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{452} normal block at 0x0000023409C8E460, 16 bytes long.
Data: < !É 4 > 88 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{451} normal block at 0x0000023409C8E500, 16 bytes long.
Data: <`!É 4 > 60 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{450} normal block at 0x0000023409C8EA00, 16 bytes long.
Data: <@!É 4 > 40 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{449} normal block at 0x0000023409C8E5F0, 16 bytes long.
Data: < !É 4 > 18 21 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{448} normal block at 0x0000023409C8E730, 16 bytes long.
Data: <ð É 4 > F0 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{447} normal block at 0x0000023409C884A0, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{446} normal block at 0x0000023409C8E9B0, 16 bytes long.
Data: <8 É 4 > 38 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{445} normal block at 0x0000023409C863C0, 16 bytes long.
Data: < É 4 > 10 20 C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{444} normal block at 0x0000023409C85BF0, 16 bytes long.
Data: <è É 4 > E8 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{443} normal block at 0x0000023409C85A60, 16 bytes long.
Data: <À É 4 > C0 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{442} normal block at 0x0000023409C86370, 16 bytes long.
Data: < É 4 > 98 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{441} normal block at 0x0000023409C86460, 16 bytes long.
Data: <p É 4 > 70 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{440} normal block at 0x0000023409C862D0, 16 bytes long.
Data: <P É 4 > 50 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{439} normal block at 0x0000023409C859C0, 16 bytes long.
Data: <( É 4 > 28 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{438} normal block at 0x0000023409C8A370, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{437} normal block at 0x0000023409C86320, 16 bytes long.
Data: < É 4 > 00 1F C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{436} normal block at 0x0000023409C885F0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{435} normal block at 0x0000023409C86410, 16 bytes long.
Data: <H É 4 > 48 1E C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{434} normal block at 0x0000023409C85FB0, 16 bytes long.
Data: < É 4 > 20 1E C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{433} normal block at 0x0000023409C85970, 16 bytes long.
Data: <ø É 4 > F8 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{432} normal block at 0x0000023409C85880, 16 bytes long.
Data: <Ð É 4 > D0 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{431} normal block at 0x0000023409C866E0, 16 bytes long.
Data: <¨ É 4 > A8 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{430} normal block at 0x0000023409C86690, 16 bytes long.
Data: < É 4 > 80 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{429} normal block at 0x0000023409C85F60, 16 bytes long.
Data: <` É 4 > 60 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{428} normal block at 0x0000023409C858D0, 16 bytes long.
Data: <8 É 4 > 38 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{427} normal block at 0x0000023409C85830, 16 bytes long.
Data: < É 4 > 10 1D C9 09 34 02 00 00 00 00 00 00 00 00 00 00
{426} normal block at 0x0000023409C91D10, 2976 bytes long.
Data: <0XÈ 4 .\7za.ex> 30 58 C8 09 34 02 00 00 2E 5C 37 7A 61 2E 65 78
{65} normal block at 0x0000023409C86550, 16 bytes long.
Data: < ê×W÷ > 80 EA D7 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x0000023409C85920, 16 bytes long.
Data: <@é×W÷ > 40 E9 D7 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x0000023409C860F0, 16 bytes long.
Data: <øWÔW÷ > F8 57 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x0000023409C85C90, 16 bytes long.
Data: <ØWÔW÷ > D8 57 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x0000023409C85B50, 16 bytes long.
Data: <P ÔW÷ > 50 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x0000023409C85DD0, 16 bytes long.
Data: <0 ÔW÷ > 30 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{59} normal block at 0x0000023409C86230, 16 bytes long.
Data: <à ÔW÷ > E0 02 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{58} normal block at 0x0000023409C85B00, 16 bytes long.
Data: < ÔW÷ > 10 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{57} normal block at 0x0000023409C860A0, 16 bytes long.
Data: <p ÔW÷ > 70 04 D4 57 F7 7F 00 00 00 00 00 00 00 00 00 00
{56} normal block at 0x0000023409C85C40, 16 bytes long.
Data: < ÀÒW÷ > 18 C0 D2 57 F7 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.
</stderr_txt>
]]> |
|
|
|
@ Erich56, @ KAMasud
Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention.
It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye. |
|
|
|
Hello,
all my tasks behave in the same way: they advance to 4% and then have no activity. I have to cancel them after several hours of idle time.
Example: https://www.gpugrid.net/result.php?resultid=33109419
The machine is equipped with a GTX1080, 32GB of RAM and 16GB of swap.
Thank you for your help
____________
|
|
|
|
Example: https://www.gpugrid.net/result.php?resultid=33109419
OSError: [WinError 1455] Le fichier de pagination est insuffisant pour terminer cette opération. Error loading "D:\BOINC\slots\3\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
Your page file still isn't large enough. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
@ Erich56, @ KAMasud
Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention.
It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye.
high Richard,
I do know how to put a hyperlink into my texts. In my previous posting, my main intention was to show the time the task was received and lateron sent back after failure. So I didn't deem it necessary to hyperlink the task itself.
But you are right: there may be more details for you guys which could be of interest, no doubt. So in the future, whenever referring to a given task, I'll hyperlink it.
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Keith wrote: The only time that is meaningful is the elapsed time between time task sent and time task result is reported. That is the closest we can get to figuring out the true elapsed time. But if you carry a large cache, then dead time sitting in your cache awaiting the chance to run inflates the true time.
Since I only carry a single task at any time, I report one task and receive its replacement on the same scheduler connection so I know my elapsed time is pretty close to the actual difference between sent time and reported time.
what you say in the last paragraph, is also true for my hosts.
I agree to what you wrote in the paragraph before. That's why in my posting, I cited the times where the tasks were received and then reported back, after failure. These were the actual runtimes, no "sitting" time included.
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
@ Erich56, @ KAMasud
Please teach yourselves how to make hyperlinks to the original record for tasks or workunits you wish to draw to our attention.
It makes this thread far more readable, and gives us access to the full picture - we might be interested in some detail that didn't catch your eye.
Richard, could you please make a different thread and teach us all the tricks? We would be very grateful.
Looked it up in Wikipedia and ended with not much. There should be some page on Boinc itself, can you give the link? |
|
|
|
There should be some page on Boinc itself, can you give the link?
There is. To the top left of the text entry box where you type a message (just below the word 'Author' on the grey divider line), there's a link:
Use BBCode tags to format your text
That opens in a separate browser window (or tab), so you can refer to it while composing your message. Use the 'quote' button below this message to see how I've made the link work here. |
|
|
|
Erich, you still misunderstand. With these Python tasks you can't just rely on the times that you reported the task. since it looks like your system sat on these tasks for some time before reporting it. you also can't rely on the runtime counters since it's been known for a long time that they are incorrect due to the multithreaded nature of them (more cores = more reported runtime), and that amount that they are incorrect will vary system to system. the ONLY accurate way to check is to look at the timestamps in the stderr output.
for sure NOT 3 minutes. Example here:
20 Oct 2022 | 1:19:26 UTC 20 Oct 2022 | 2:57:36 UTC Error while computing 3,780.66 3,780.66 --- Python apps for GPU hosts v4.04 (cuda1131)
so, in above example, the task failed after 1 Hr 38 mins.
link to this one: http://www.gpugrid.net/result.php?resultid=33105596
from the stderr:
04:45:25 (5200): wrapper (7.9.26016): starting
04:45:25 (5200): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)
04:48:28 (5200): .\7za.exe exited; CPU time 179.609375
04:48:28 (5200): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
04:48:29 (5200): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
04:48:29 (5200): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)
04:49:00 (5200): .\7za.exe exited; CPU time 30.109375
04:49:00 (5200): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
04:49:02 (5200): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
04:49:02 (5200): wrapper: running python.exe (run.py)
Starting!!
...
[lots of traceback errors here]
[then..]
04:55:55 (5200): python.exe exited; CPU time 3570.937500
04:55:55 (5200): app exit status: 0x1
04:55:55 (5200): called boinc_finish(195)
just look at the timestamps. you started processing the task at 4:45 and boinc finished it at 4:55. it only actually ran for 10 mins. you either waited ~1hr before starting this tasks, or waited ~1hr before reporting it. it is very common behavior for the BOINC client to extend your project communication time when it detects a computation error.
20 Oct 2022 | 1:44:50 UTC 20 Oct 2022 | 3:08:40 UTC Error while computing 5,195.80 5,195.80 --- Python apps for GPU hosts v4.04 (cuda1131)
here, the task failed after 1 hr 23 mins.
this task here: http://www.gpugrid.net/result.php?resultid=33105606
04:56:11 (9280): wrapper (7.9.26016): starting
...
05:06:33 (9280): called boinc_finish(195)
same story here, only ran for 10 minutes.
but, interestingly enough, here the relation is quite different:
22 Oct 2022 | 6:41:59 UTC 22 Oct 2022 | 7:07:44 UTC Error while computing 70,694.64 70,694.64 --- Python apps for GPU hosts v4.04 (cuda1131)
the task obviously failed after 25 minutes, although runtime and CPU time as indicated would suggest >19 hrs.
These indications are somewhat unclear (to me).
this task here: http://www.gpugrid.net/result.php?resultid=33111849
08:42:24 (6280): wrapper (7.9.26016): starting
...
09:05:40 (6280): called boinc_finish(195)
this one ran for about 23mins. there was less of a delay in starting or reporting this one.
I hope this clarifies what you should be looking at to make accurate determinations about run time.
____________
|
|
|
|
BOINC itself makes it even easier to check the numbers. In the root of the BOINC data folder, you'll find a plain text file called
job_log_www.gpugrid.net.txt
It contains one line for each successful task, newest at the bottom.
Here's one of my recent shorties - task 33104232
1666088826 ue 1354514.775804 ct 1290.400000 fe 1000000000000000000 nm e00001a00003-ABOU_rnd_ppod_expand_demos25_17-0-1-RND1967_0 et 541.083257 es 0
That's very dense, but we're only interested in two numbers:
ct 1290.400000
et 541.083257
That's "CPU time" and "elapsed time", respectively. You'll see that both of those have been converted to 1,290.40 in the online report. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
ok guys, many thanks for clarification :-) I now got it :-)
So, as it seems, none of my tasks were running for 23 hours or so before they failed; which is very good news! |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
There should be some page on Boinc itself, can you give the link?
There is. To the top left of the text entry box where you type a message (just below the word 'Author' on the grey divider line), there's a link:
Use BBCode tags to format your text
That opens in a separate browser window (or tab), so you can refer to it while composing your message. Use the 'quote' button below this message to see how I've made the link work here.
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?
[quote]27329068[quote]
I do not think it will work though.
Forget that I even asked.
[list]27329068[list]
Yuck. How do I get that WU number to pop up? |
|
|
|
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?
OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked.
I've got it open in another tab. The address bar in that tab is showing the full url:
https://www.gpugrid.net/result.php?resultid=33104232
First, I type the word task into the message.
task
Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message:
{url}task{/url}
Then, I put an equals sign in the first bracket, and add that address from the other tab:
{url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url}
Finally, I double-click on the number, copy it, and paste it in the central section:
{url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url}
I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is:
task 33104232
In summary:
The first bracket contains the page on the website you want to take people to.
Between the brackets, you can put anything you like - a simple description.
The final bracket simply tidies things up neatly.
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?
OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked.
I've got it open in another tab. The address bar in that tab is showing the full url:
https://www.gpugrid.net/result.php?resultid=33104232
First, I type the word task into the message.
task
Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message:
{url}task{/url}
Then, I put an equals sign in the first bracket, and add that address from the other tab:
{url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url}
Finally, I double-click on the number, copy it, and paste it in the central section:
{url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url}
I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is:
task 33104232
In summary:
The first bracket contains the page on the website you want to take people to.
Between the brackets, you can put anything you like - a simple description.
The final bracket simply tidies things up neatly.
At least our brains are at par. Maybe the steamships I worked on.
task 27329068
Let us give it a try.
I re-edited. :) |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Thank you, Richard. I will give it a try, at my age. Difficult but where do you get the matter to put in the middle? For example the WU?
OK, let's go through it step-by-step. This is how my seventy-year-old brain breaks it down. We'll use the most recent one I linked.
I've got it open in another tab. The address bar in that tab is showing the full url:
https://www.gpugrid.net/result.php?resultid=33104232
First, I type the word task into the message.
task
Then, I swipe across that word (all four letters) to highlight it, and click the URL button above the message:
{url}task{/url}
Then, I put an equals sign in the first bracket, and add that address from the other tab:
{url=https://www.gpugrid.net/result.php?resultid=33104232}task{/url}
Finally, I double-click on the number, copy it, and paste it in the central section:
{url=https://www.gpugrid.net/result.php?resultid=33104232}task 33104232{/url}
I've been changing the square brackets into braces, so they can be seen. Changing them back, the finished result is:
task 33104232
In summary:
The first bracket contains the page on the website you want to take people to.
Between the brackets, you can put anything you like - a simple description.
The final bracket simply tidies things up neatly.
At least our brains are at par. Maybe the steamships I worked on.
task 27329068
Let us give it a try.
I re-edited. :)
Anyway, as you all can read the txt files being generated get confused about completion time. I watch the Task Manager. As soon as the sawtooth goes, I know.
It took three minutes. |
|
|
|
This has been reported and explained many times in this thread. These tasks report CPU time as elapsed time. That’s why it’s so far off. Since these tasks are multithreaded, CPU time gets greatly inflated.
A normal GPU task might use 100% of a single core, in that case CPU time matches pretty closely to elapsed time. That’s what we are used to seeing.
However, these tasks are multithreaded. Using 32 threads or more for processing (and constrained by your physical hardware if less than that). When it’s multithreaded, CPU time is equal to the SUM of the CPU time from all the threads that processed that WU. as a simplistic example, say you have a 4-thread CPU and the task used all threads at 75% utilization for 5 minutes. CPU time (in seconds) would be 4*0.75*300=900 seconds. Now you can see how adding more cores can greatly increase this number.
Looking at the start and stop timestamps of your task, it ran for about 5 mins.
____________
|
|
|
|
These tasks report CPU time as elapsed time.
Actually, that's not quite right.
The report (made in sched_request_www.gpugrid.net.xml) is accurate - it's after it lands in the server that it's filed in the wrong pocket.
I've got a couple of tasks finishing in the next hour / 90 minutes - I'll try to catch the report for one of them. |
|
|
|
It’s correct. You just misinterpreted my perspective.
I was talking about what the website reports to us. Not what we report to the server.
____________
|
|
|
|
Anyway, I caught one just to clarify my perspective.
<result>
<name>e00021a01361-ABOU_rnd_ppod_expand_demos25_20-0-1-RND2109_0</name>
<final_cpu_time>151352.900000</final_cpu_time>
<final_elapsed_time>54305.405065</final_elapsed_time>
<exit_status>0</exit_status>
<state>5</state>
<platform>x86_64-pc-linux-gnu</platform>
<version_num>403</version_num>
<plan_class>cuda1131</plan_class>
<final_peak_working_set_size>4950069248</final_peak_working_set_size>
<final_peak_swap_size>17198002176</final_peak_swap_size>
<final_peak_disk_usage>10656485468</final_peak_disk_usage>
<app_version_num>403</app_version_num>
That matches what it says in the job log:
ct 151352.900000 et 54305.405065
But not what is says on the website:
task 33116901
I'm going on about it, because if it was a problem in the client, we could patch the code and fix it. But because it happens on the server, it's not even worth trying. Precision in language matters. |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml.
The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc.
I have a question: Currently, I'm running a Python task with 1 core and one GPU.
Would the crunching time decrease, if I allocate more cores to this tasks? 2 cores equals 50%, 4 cores equals 25% ?
I know how to tweak the app_config.xml, but I want to ask before I waist time with tinkering. |
|
|
|
If you want to inflate both values, all that is needed is to allocate more cores to the task in a cpu_usage parameter in an app_config.xml.
The task runs in whatever time it needs on your hardware. If one core is used to compute the task the time for cpu_time and run_time = 1X. If two cores are used then the time is 2X, 5 cores = 5X etc.
I have a question: Currently, I'm running a Python task with 1 core and one GPU.
Would the crunching time decrease, if I allocate more cores to this tasks? 2 cores equals 50%, 4 cores equals 25% ?
I know how to tweak the app_config.xml, but I want to ask before I waist time with tinkering.
I assume you're talking about the app_config settings when you say "allocate". as a reminder, these settings do not change how much CPU is used by the app. the app uses whatever it needs no matter what settings you choose (up to physical constraints). the only way you can constrain CPU use is to do something like run a virtual machine with less cores allocated to it than the host has. otherwise the app still has full access to all your cores, and if you monitor cpu use by the various processes you'll observe this.
if you're not running any other tasks (other CPU projects) at the same time, then changing the CPU allocation likely wont have any impact to your completion times since it's already using all of your cores.
____________
|
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Thanks for the fast reply. I'm running MCM from WCG on my machine in parallel. I will do a short test and suspend all other tasks. The question is: Will Python add more cores to this task if the other cores become available?
My system: Ryzen 9 5950X, NVidia RTX 3060 Ti, 64 GB RAM, WIN 10 |
|
|
|
don't think of it in that sense.
these tasks will spawn 32+ processes no matter how many cores you have or how much you allocate in BOINC. these processes need to be serviced by the CPU. if you have many processes and not enough threads to service them all, they will need to wait in the priority queue against all other processes.
increasing the BOINC CPU allocation for the Python tasks, will stop processing by other competing BOINC CPU tasks, leaving more free available resources to the Python processes. so they will get the opportunity use more CPU in a shorter amount of time, but probably not much different total CPU time. meaning the tasks should run faster since they aren't competing with the other CPU work.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
...the only way you can constrain CPU use is to do something like run a virtual machine with less cores allocated to it than the host has. otherwise the app still has full access to all your cores, and if you monitor cpu use by the various processes you'll observe this.
if you're not running any other tasks (other CPU projects) at the same time, then changing the CPU allocation likely wont have any impact to your completion times since it's already using all of your cores.
however, you guys recently stated that best way is not to run any other projects while processing Python tasks.
I can confirm. A week ago, I ran one LHC-ATLAS task, 2-core (virtual machine) together with 2 Pythons (1 each per GPU), and after a while the system crashed.
Since then, only Pythons are being processed - no crashes so far.
|
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Well,
CPU load was 100 % before with 30 MCM tasks running in parallel. Now, only the Python task is running and the CPU load is between 40 and 75 %. GPU load has not changed and is between 18 and 22 % like before.
Looks like it is progressing faster than before ;-) |
|
|
GSSend message
Joined: 16 Oct 22 Posts: 12 Credit: 1,382,500 RAC: 0 Level
Scientific publications
|
Found a nice balance between MCM and Python tasks. Now I run 7 MCM and 1 Python tasks and the CPU load is about 99 %. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
there was a task which ran for about 20 hours and yielded a credit of 45.000
https://www.gpugrid.net/result.php?resultid=33117861
how come ? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Currently, credits are not defined by execution time, but by the maximum possible compute effort. In particular for these AI experiments which consist on training AI agents, a maximum number of learning steps for the AI agents is defined as a target. That means that the agent interacts with its simulated environment and then learns from these interactions a certain amount of time.
However, if some condition is met earlier, the task ends. There is a certain amount of randomness in the learning process, but the amount of credits is defined by the upper bound of training steps, independently of whether the task finished earlier or not. That is the amount of learning steps that the agent would do if the early stopping condition is never met.
In general the condition is met more often by earlier RL agents in the populations that by later ones. Also can vary from experiment to experiment. Locally the task last on average 10-14h.
____________
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
don't think of it in that sense.
these tasks will spawn 32+ processes no matter how many cores you have or how much you allocate in BOINC. these processes need to be serviced by the CPU. if you have many processes and not enough threads to service them all, they will need to wait in the priority queue against all other processes.
increasing the BOINC CPU allocation for the Python tasks, will stop processing by other competing BOINC CPU tasks, leaving more free available resources to the Python processes. so they will get the opportunity use more CPU in a shorter amount of time, but probably not much different total CPU time. meaning the tasks should run faster since they aren't competing with the other CPU work.
I have a question also. Maybe Richard might understand better. I run CPDN tasks also which are very few and far between. So I gave zero resources to Moo Wrapper and ran it in parallel. No CPDN task then Moo would send me WUs.
Now with GPUgrid tasks, this is not the case. These tasks do not register in Boinc as a task for some reason. If I am crunching a GPUgrid task then I SHOULD not get a Moo task. That is the correct procedure but what happened when I shifted from CPDN to here, I was running one GPUgrid(on all cores) task as well as twelve Moo tasks. That is thirteen tasks. I am not worried about if it can be done but why is this happening? |
|
|
|
Without having full details of how your copy of BOINC is configured, and how the tasks from each project are configured to run (in particular, the resource assignment for each task type) it's impossible to say.
This may help:
That machine has six CPU cores, but it's only running five tasks. That's because BOINC has committed 3+1+0.5+0.5+1 = 6 cores, and there are none left. If one of the GPU applications had been configured to require 2.99 CPUs, or 0.49 CPUs, the total core allocation would have fallen "below six", and BOINC's rules say that another task can be started. |
|
|
|
Example: https://www.gpugrid.net/result.php?resultid=33109419
OSError: [WinError 1455] Le fichier de pagination est insuffisant pour terminer cette opération. Error loading "D:\BOINC\slots\3\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
Your page file still isn't large enough.
I need to push swap size file up to 32GB but now it's OK.
Even if the GPU activity rate is low and the Python task does not respect the number of threads allocated to it... no problem, go ahead science ! |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Without having full details of how your copy of BOINC is configured, and how the tasks from each project are configured to run (in particular, the resource assignment for each task type) it's impossible to say.
This may help:
That machine has six CPU cores, but it's only running five tasks. That's because BOINC has committed 3+1+0.5+0.5+1 = 6 cores, and there are none left. If one of the GPU applications had been configured to require 2.99 CPUs or 0.49 CPUs, the total core allocation would have fallen "below six", and BOINC's rules say that another task can be started.
Boinc version 7.20.2. Stock, out of the box. If there is a thread where I can learn mischief let me know.
It is stock Boinc and I have allocated 100% of resources to GPUGrid plus 0% resources to Moo Wrapper. In case of no task from GPUGrid, I can get Moo tasks.
I am in a hot, arid part of South Asia so I have to keep an eye on Temperatures. I don't want a puddle of plastic. Having too many cores is not an advantage in my case. |
|
|
|
According to my work in progress listings, I received this WU listed in progress: https://www.gpugrid.net/result.php?resultid=33134063 but it is non existent on the computer. Since it doesn't exist, I can't abort it or anything so the project will have to remove it from my queue and reassign it. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
on one of my hosts a Python has now been running for almost 3 times as long as all the "long" ones before.
There is CPU activity, also GPU activity + VRAM usage in the usual range. Also RAM.
The slot in the project folder is also filled with some 8,25GB.
Still I am not sure whether this task maybe has hung up itself some way.
Could this still be a valid task, or should I better terminate it? |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
on one of my hosts a Python has now been running for almost 3 times as long as all the "long" ones before.
There is CPU activity, also GPU activity + VRAM usage in the usual range. Also RAM.
The slot in the project folder is also filled with some 8,25GB.
Still I am not sure whether this task maybe has hung up itself some way.
Could this still be a valid task, or should I better terminate it?
I now looked up the task history - it failed on 7 other hosts.
So I'd better cancel it :-)
|
|
|
|
Can you check whether wrapper_run.out changes and number of samples collected?
There should be a config file in slot directory that contains start sample number and end sample number. You can use subtraction to determine target number of samples. |
|
|
|
File name is conf.yaml
parameters are
start_env_steps and target_env_steps. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
File name is conf.yaml
parameters are
start_env_steps and target_env_steps.
I had already abortet the task mentioned above when I now read your posting.
But I looked up the figures in a task which is in process right now. It says:
32start_env_steps: 25000000
sticky_actions: true
target_env_steps: 50000000
so what exactly do the figures mean: in this case, about half of the task has been processed? |
|
|
|
I think it means that previous crunchers have already crunched up to 25000000 steps and your workunit will continue to 50000000.
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes this is exactly what it means. Most parameters in the config file define the specifics of the agent training process.
In this case these parameters specify that the initial AI agent will be loaded from a previous agent that already took 25_000_000M steps in his simulated environment, so it is not taking completely random actions. The agent will continue the process, interacting 25_000_000M more times with the environment and learning from its successes and failures.
Other parameter specify the type of algorithm used for learning, the number of copies of the environment used to speed up the interactions (32), and many other things.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
what I noticed within the past few days is that the runtime of the Pythons has increased.
Whereas until short time ago on all of my hosts some tasks made it below 24 hrs, now every task lasts > 24 hrs. |
|
|
|
Try to reduce number of simultaneously running workunits. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I've rarely had a short runner in weeks. Now almost all tasks take more than 24 hours.
Missing by a few minutes usually which is disheartening.
But I won't be reducing the compute load since I only run a single Python task on each host along with multiple other projects work.
I just accept the lesser credit while still maintaining a full load of my other projects which aren't impacted too much by the single Python task. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
What I am noticing is, my two machines running no other project are completing the tasks which others have errored out on. I think Python loves to run free without companions to keep it company.
this is exactly my observation, too. |
|
|
AsghanSend message
Joined: 30 Oct 19 Posts: 6 Credit: 405,900 RAC: 0 Level
Scientific publications
|
The only thing I noticed:
The biggest lie for the new python tasks is "0.9 CPU".
My current task, and the one before, were/is using 20 out of my 24 cores on my 5900X...
Please support the Tensor Cores as soon as possible, my 4090 is getting bored :/ |
|
|
|
Some errored tasks crash because someone was trying to run them on GTX 680 with 2 gb vram. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
task 33145039
Example. Seven computers have crashed this work unit. Richard or someone else who can read the files can find out why. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello! I just checked the failed submissions of this jobs, and in each case it failed for a different reason.
1. ERROR: Cannot set length for output file : There is not enough space on the disk
2. DefaultCPUAllocator: not enough memory (GPU memory?)
3. RuntimeError: Unable to find a valid cuDNN algorithm to run convolution (GPU not supported by cuda?)
4. Failed to establish a new connection (connection failed to install the only pipeable dependency)
5. AssertionError. assert ports_found (some port configuration missing?)
6. BrokenPipeError: [WinError 232] The pipe is being closed (for some reason multiprocessing broke, I am guessing not enough memory since windows uses much more memory than linux when running multiprocessing)
7. lbzip2: Cannot exec: No such file or directory
It is quite unlikely that it fails 7 times, but each machine has a different configuration it is very difficult to cover all cases. That is the reason why jobs are submitted multiple times after failure, to be fault tolerant.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.
I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:
(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.
____________
|
|
|
|
you'd need to find a way to get the task loaded fully to the GPU. the environment training that you're doing on CPU, can you do that same processing on the GPU? probably.
____________
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.
I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:
(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.
-----------------
Thank you. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
These tasks alternate between GPU usage and CPU usage, would it make such a big difference to use Tensor Cores for mixed precision? You would be trading speed for precision but only speeding up the GPU phases.
I was looking at pytorch documentation (the python package we use to train the AI agents, which supports using Tensor Cores for mixed precision) for automatic-mixed-precision and it says:
(if) Your network may fail to saturate the GPU(s) with work, and is therefore CPU bound. Amp’s effect on GPU performance won’t matter.
-----------------
Thank you. |
|
|
|
I'm being curious here...
These Python apps don't seem to report their virtual memory usage accurately on my hosts. They show 7.4GB while my commit charge shows 52BG+ (with 16GB RAM).
They report more CPU time than the amount of time it actually took my hosts to finish them.
They're also causing the CPU usage to max out around 50% when there are no other CPU tasks running, no matter what my boinc manager CPU usage limit is.
Could anyone please explain this to a confused codger?
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
These tasks are a bit particular, because they use multiprocessing and also interleave stages of CPU utilisation with stages of GPU utilisation.
The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads). That, together with the fact that the tasks use a python library for machine learning called PyTorch, accounts for the large virtual memory (every thread commits virtual memory when the package is imported, even though it is not later used).
The switching between CPU and GPU phases could be causing the CPU's to be at 50%.
Other hosts have found configurations to improve resource utilisation by running more than one task, some configurations are shared in this forum.
____________
|
|
|
gemini8 Send message
Joined: 3 Jul 16 Posts: 31 Credit: 2,212,787,676 RAC: 4,959,470 Level
Scientific publications
|
The multiprocessing nature of the tasks is responsible for the wrong CPU time (BOINC takes into account the time of all threads).
I don't think so.
The CPU-time should be correct, it's just that the overall runtime is faulty.
You can easily see that if you compare the runtime to the send and receive times.
____________
- - - - - - - - - -
Greetings, Jens |
|
|
|
Feliz navidad, amigos!
Odd thing about Pythons using GPU. They seem incoherent about their time reportage.
I see them finish in around 10-12 hours but the cpu time is much greater. Often around 80 hrs.
Looking at the properties of a running task I see:
Application
Python apps for GPU hosts 4.04 (cuda1131)
Name
e00007a01485-ABOU_rnd_ppod_expand_demos30_2_test2-0-1-RND0975
State
Running
Received
12/25/2022 1:49:40 AM
Report deadline
12/30/2022 1:49:40 AM
Resources
0.988 CPUs + 1 NVIDIA GPU
Estimated computation size
1,000,000,000 GFLOPs
CPU time
3d 00:26:17
CPU time since checkpoint
00:04:01
Elapsed time
09:17:00
Estimated time remaining
07:04:18
Fraction done
96.160%
Virtual memory size
6.91 GB
Working set size
1.66 GB
Directory
slots/0
Process ID
16952
Progress rate
10.440% per hour
Executable
wrapper_6.1_windows_x86_64.exe
________________
Notice the cpu time vs the elapsed time.
I also see that the estimate of time remaining runs ridiculously high.
Though the wrapper claims to be 0.988 CPUs it is actually using up to 70% on machines with fewer cores when nothing else is running. The more cpu time slices it can get of any available threads the faster it seems to run up to the point of max usable by the wrapper. It also eats up as much as 50GB of commit charge (total memory) and more.
It seems to be immune to the BOINC manager limits on cpu usage, so it can easily peg your processor usage with other projects running. Setting max processor usage at 25-33% should ensure that the WUs finish within the 105,000 point bonus time frame if you are running other projects simultaneously.
Another observation I made was that dual graphics cards don't seem to work with this wrapper on my hosts. GPU1 always seemed to stay at 4% or so while GPU0's WU ran at a reduced speed.
This limitation is coincidentally identical to my experience on MLC@Home. In addition, I am seeing that very modest GPUs (1050s and such) are just as effective as the latest models at producing points when the cpu runs unconstrained. That was also noted at MLC from my experience.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
This has been commented on extensively in this thread if you had read it.
The cpu_time is not calculated correctly because BOINC has no mechanisim to deal with these one of a kind tasks who are using machine learning and are of dual cpu-gpu nature.
The tasks spawn 32 processes on your cpu and will use a significant amount of cpu resources and main and virtual memory.
They sporadically use your gpu in brief spurts of computation before passing computation back to the cpu.
As long as the gpu has 4 GB of VRAM, the tasks can be run on very moderate gpu hardware. |
|
|
|
You can create app_config.xml with
<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact/>
</app>
</app_config>
It should make it display more accurate time estimation. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
You can create app_config.xml with
<app_config>
<app>
<name>PythonGPU</name>
<fraction_done_exact/>
</app>
</app_config>
It should make it display more accurate time estimation.
for me, this worked well with all the ACEMD tasks. It does NOT work with the Pythons.
I am talking about Windows; maybe it works with Linux, no idea.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
As I mentioned BOINC has no idea how to display these tasks because they do not fit in ANY category that BOINC is coded for.
So no existing BOINC mechanism can properly display the cpu usage or get even close with time estimations.
Does not matter whether the host is Windows, Mac or Linux based. The OS has nothing to do with the issue.
The issue is BOINC. |
|
|
|
Thanks for the tips guys. Sorry about being captain obvious there, I just rejoined this project and should have caught up on the thread before reporting my observations.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Right now: ~ 14.200 "unsent" Python tasks in the queue.
I guess it will take a while until they all are processed. |
|
|
|
Looks good but getting some bad WUs.
Had 3 errors in a row on the same host and thought it was something about the machine until I checked to see who else ran them. They were on their last chance runs.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Could you provide the name of the task? I will take a look at the errors.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Could you provide the name of the task? I will take a look at the errors.
In case you don't want to wait until he reads your posting for replying - look here:
http://www.gpugrid.net/results.php?hostid=602606 |
|
|
|
It seems some of them need more pagefile than usual |
|
|
|
I see the tasks that my host and some others crashed were successfully finished eventually. Sorry to have assumed before the fact.
I suspect my host had errors because it was running Mapping Cancer Markers concurrent with Python. Once I suspended WCG tasks it has run error free.
Thanks to Eric for providing the host link.
Sorry for the misinformation.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
|
abouh,
can you confirm the section of code that the task spends the most time on?
is it here?
while not learner.done():
learner.step()
I'm still trying to track down why AMD systems use so much more CPU than Intel systems. I even went so far as to rebuild the numpy module against MKL (yours is using the default BLAS, not MKL or OpenBLAS). and injecting it into the environment package. but it made no difference again. probably because it looks like numpy is barely used in the code anyway and not in the main loop.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Pop Piasa wrote:
I suspect my host had errors because it was running Mapping Cancer Markers concurrent with Python. Once I suspended WCG tasks it has run error free.
I had made the same experience when I began crunching Pythons.
Best is not to run anything else.
|
|
|
theBigAlSend message
Joined: 4 Oct 22 Posts: 4 Credit: 2,242,428,633 RAC: 1,219,236 Level
Scientific publications
|
I've been running WCG (CPU only tasks though) and GPUGrid concurrently past few days and its working out fine so far. |
|
|
|
My Intel hosts seem to have no problems, only my Ryzen5-5600X. Same memory in all of them. That is indeed odd because theBigAl is using the exact same processor without errors. one difference is that theBigAl is running Windows 11 where I have Win 10 on my host.
Erich56 is spot-on that Python likes to have the machine (or virtual machine) to itself for these integrated gpu tasks. I have seen them drop from around 14 hrs. to under twelve hrs. completion time by stopping concurrent projects.
How does one get two or more of these to run with multiple gpus in a host?
I took a second card back out of one of my hosts because it just slowed it down running these.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Because of the unique issue with virtual memory on Windows compared to Linux, I don't know if running more than a single task is doable, let alone running multiple gpus.
And yes, it is possible to run more than a single gpu on these tasks in Linux.
My teammate Ian has been running 3X concurrently on his 2X 3060's and now 2X RTX A4000 gpus. |
|
|
theBigAlSend message
Joined: 4 Oct 22 Posts: 4 Credit: 2,242,428,633 RAC: 1,219,236 Level
Scientific publications
|
My Intel hosts seem to have no problems, only my Ryzen5-5600X. Same memory in all of them. That is indeed odd because theBigAl is using the exact same processor without errors. one difference is that theBigAl is running Windows 11 where I have Win 10 on my host.
I dont know if it'll help but I have allocated 100Gb of virtual memory swap for the computer which is probably an overkill but doesn't hurt to try if you got the space.
I'll up that to 140Gb when I'll eagerly receive my 3060ti tomorrow and testing out if it can run multiple GPU tasks on Win11 (probably not and even if it does it'll run a lot slower since it'll be CPU bound then) |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Keith Myers wrote:
Because of the unique issue with virtual memory on Windows compared to Linux, I don't know if running more than a single task is doable, let alone running multiple gpus.
On my host with 1 GTX980ti and 1 Intel i7-4930K I run 2 Pythons concurrently.
On my host with 2 RTX3070 and 1 i9-10900KF I run 4 Pythons concurrently.
On my host with 1 Quadro P5000 and 2 Xeon E5-2667 v4 I run 4 Pythons concurrently.
All Windows 10.
No problems at all (except that I don't make it below 24hours with any task)
|
|
|
|
abouh,
as a followup to my previous post, I think I've narrowed down the issue in your script/setup that causes unnecessarily high CPU use for newer and high core count hosts. I was able to reduce the CPU use from 100% to 40%, and speed up task execution at the same time (due to much less scheduling conflicts with so many running processes). I was able to connect with someone who understands these tools and they helped me figure out what's wrong, I'll paraphrase their comments and findings below.
the basic answer is that the thread pool isn't configured correctly for wandb. (Its only configured for parser so its unlikely correctly limiting amount of threads - and likely there's a soft error somewhere)
Line 447 & 448
spawns threads, but doesn't specify them anywhere.
Line 373
defines how many thread processes will be used; but it doesnt seem to work correctly. it's defined as 16, but changing this value does nothing, and on my 64-core system, 64 processes are spun up for each running task. in addition to the 32 agents spawned. a 16-core CPU will spin up 16+32 processes, and so on. trying to run 10 concurrent tasks on my 64-core system results in a staggering 960 processes being run, this seems to cripple the system and it slows things down as a result.
https://docs.wandb.ai/guides/track/advanced/distributed-training
(by end of the page, shows how they are configured correctly)
do you get the error log in the npz output? is this send back with tasks? I tried to read this file but could not, it's compressed or encrypted. it may contain more information about what is setup wrong with the wandb mp pool.
I was able to work around this issue by setting environment variables to put hard limits on the number of processes used. i edited run.py at line 445 with:
NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS
but it's not a proper fix. I added further workarounds to make this a little more persistent for myself, but it will need to be fixed by the project to fix for everyone. proper fix would be investigating what is the soft error in the error log file, with full access to the job (which we don't have - and we cannot implement proper mp without it).
you could band-aid fix with the same edits I have for run.py, but It might cause issues if you have less than 8 threads I guess? or maybe it's fine since the script launches so many processes anyway. I'm still testing to see if there's a point where less threads on run.py actually slows the task down. on these fast CPUs I might be able to run as little as 4.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Thanks for keep digging into this high cpu usage bug Ian. I missed the last convos on your other thread at STH I guess.
Hope that abouh can implement a proper fix. That should increase the return rate dramatically I think. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello Ian,
learner.step()
Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive)
Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right?
Thanks a lot for your help!
If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed
____________
|
|
|
|
Hello Ian,
learner.step()
Is the line of code the task spends most time on. this function handles first the collection of data (CPU intensive) + takes one learning step (updating the weights of the agent neural networks, GPU intensive)
Regarding your findings with respect to wandb, I could remove the wandb dependency. I can simply make a run.py script that does not use wandb. It is nice to have a way to log extra training information, but not at the cost of reducing task efficiency. And I get a part of that information anyway when the task comes back. I understand that simply getting rid of wandb would be the best solution right?
Thanks a lot for your help!
If that is the best solution, I will work on a run.py without wandb. I can start using it as soon as the current batch (~10,736 now) is processed
removing wandb could be a start, but it's also possible that it's not the sole cause of the problem.
are you able to see any soft errors in the logs from reported tasks?
do you have any higher core count (32+ cores) systems in your lab or available to test on?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed.
I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.
NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS
Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help?
Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl).
____________
|
|
|
|
I think I may have encountered the Linux version of the Windows virtual memory problem.
I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten.
It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time.
I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory. |
|
|
|
There are programs that can display what files use most space on disk. For example K4DirStat |
|
|
|
ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed.
I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.
NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS
Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help?
Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl).
i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure.
let me get back to you if you could print some errors from within the run.py script.
and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go.
____________
|
|
|
|
I think I may have encountered the Linux version of the Windows virtual memory problem.
I have been concentrating on another project, where a new application is generating vast amounts of uploadable result data. They deployed a new upload server to handle this data, but it crashed almost immediately - on Christmas Eve. Another new upload server may come online tonight, but in the meantime, my hard disk has been filling up something rotten.
It's now down to below 30 GB free for BOINC, so I thought it was wise to stop that project, and do something else until the disk starts to empty. So I tried a couple of python tasks on host 132158: both failed with "OSError: [Errno 28] No space left on device", and BOINC crashed at the same time.
I'm doing some less data-intensive work at the moment, and handling the machine with kid gloves. Timeshift is implicated in a third crash, so I've been able to move that to a different drive - let's see how that goes. I'll re-test GPUGrid when things have settled down a bit, to try and confirm that virtual memory theory.
probably need some more context about the system.
how much disk drive space does it have?
how much of that space have you allowed BOINC to use?
how many Python tasks are you running?
Do you have any other projects running that cause high disk use?
each expanded and running GPUGRID_Python slot looks to take up about 9GB. (the 2.7GB archive gets copied there, expanded to ~6.xGB, and and archive remains in place). so that's 9 GB per task running + ~5GB for the GPUGRID project folder depending on if you've cleaned up old apps/archives or not. if your project folder is carrying lots of the old apps, a project reset might be in order to clean it out.
____________
|
|
|
|
how much disk drive space does it have?
how much of that space have you allowed BOINC to use?
how many Python tasks are you running?
Do you have any other projects running that cause high disk use?
This is what BOINC sees:
It's running on a single 512 GB M.2 SSD. Much of that 200 GB is used by the errant project, and is dormant until they get their new upload server fettled.
One Python task - the other GPU is excluded by cc_config.
Some Einstein GPU tasks are just finishing. Apart from that, just NumberFields (lightweight integer maths).
Within the next half hour, the Einstein tasks will vacate the machine. I'll try one Python, solo, as an experiment, and report back. |
|
|
|
So it looks like you’ve set BOINC to be allowed use to the whole drive or so? Or only 50%?
The 234GB “used by other programs” seems odd. Are you using this system to store a large amount of personal files too? Do you know what is taking up nearly half of the drive that’s not BOINC related?
If you’re not aware of what’s taking up that space. Check /var/log/, I’ve had it happen that large amounts of errors filling up the syslog and kern.log files and filling the disk.
____________
|
|
|
|
The machine is primarily a BOINC cruncher, so yes - BOINC is allowed to use what it wants. I'm suspicious about those 'other programs', too - especially as my other Linux machine shows a much lower figure. The main difference between then is that I did an in-situ upgrade from Mint 20.3 to 21 not long ago, and the other machine is still at 20.3 - I suspect there may be a lot of rollback files kept 'just in case'.
And yes, I'm suspicious of the logs too - especially BOINC writing to the systemd journal, and that upgrade. Next venue for an exploration.
I've been watching the disk tab in my peripheral vision, as the test task started. 'Free space for BOINC' fell in steps through 26, 24, 22, 21, 20 as it started, and has stayed there. Now at around 10% progress / 1 hour elapsed.
Should have mentioned - machine has 64 GB of physical RAM, in anticipation of some humongous multi-threaded tasks to come.
Edit - new upload server won't be certified as 'fit for use' until tomorrow, so I've started Einstein again. |
|
|
|
What other project? |
|
|
|
What other project?
Name redacted to save the blushes of the guilty!
|
|
|
|
Looks like this was a false alarm - the probe task finished successfully, and I've started another. Must have been timeshift all along.
The nameless project is still hors de combat. The new server is alive and ready, but can't be accessed by BOINC. |
|
|
|
You mean ithena? |
|
|
|
ok, in that case I will start by removing wandb in the next batch of tasks. Let’s see if that improves performance. I will make a post to inform about the submission once it is done, will probably still take a few days since the latest batch is still being processed.
I have access to machines with up to 32 cores for testing. I will also try setting the same environment flags. To see what happens.
NUM_THREADS = "8"
os.environ["OMP_NUM_THREADS"] = NUM_THREADS
os.environ["OPENBLAS_NUM_THREADS"] = NUM_THREADS
os.environ["MKL_NUM_THREADS"] = NUM_THREADS
os.environ["CNN_NUM_THREADS"] = NUM_THREADS
os.environ["VECLIB_MAXIMUM_THREADS"] = NUM_THREADS
os.environ["NUMEXPR_NUM_THREADS"] = NUM_THREADS
Unfortunately the error logs I get do not say much… at least I don’t see any soft errors. Is there any information which can be printed from the run.py script that would help?
Regarding full access to the job, the python package we use to train the AI agents is public and mostly based in pytorch, in case anyone is interested (https://github.com/PyTorchRL/pytorchrl).
i'm sure if you set those same env flags, you'll get the same result I have. less CPU use and threads used for python per task based on the NUM_THREADS you set. I'm testing "4" now and it doesn't seem slower either. will need to run it a while longer to be sure.
let me get back to you if you could print some errors from within the run.py script.
and yeah, no worries about waiting for the batch to finish up. still over 9000 tasks to go.
4 seems to be working fine.
abouh, if removing wandb doesn't fix the problem, then adding the env variarables listed above with num_threads = 4 will probably be a suitable workaround for everyone. probably not many hosts with less than 4 threads these days.
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Excuse the dumb question but would that then mean the app would only spin up 4 threads?
On Windows, I have manually capped the app to 24 threads and it uses all of them, my Linux box capped at 6 threads has half the threads idling.
Both seem to take about the same time though, what is the Windows app doing with all the threads that the Linux app does not need? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users.
I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps:
1. Wait for the current batch to finish (currently 3,726 tasks)
2. Then I will update PyTorchRL library.
3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion)
4. Send again a big batch with the new code to PythonGPU.
The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue.
I will make a post once I submit the Beta tasks.
____________
|
|
|
|
Can http://www.gpugrid.net/apps.php link be put next to Server status link? |
|
|
|
I have been testing the new script without wandb and the proposed environ configuration and works fine. In my machine performance is similar but looking forward to receiving feedback from other users.
I also need to update the PyTorchRL library (our main dependency), so my idea is to follow these steps:
1. Wait for the current batch to finish (currently 3,726 tasks)
2. Then I will update PyTorchRL library.
3. Following I will send a small batch (20-50) to PythonGPUBeta with the new code to make sure everything works fine (I have tested locally, but it is always worth sending a test batch to Beta in my opinion)
4. Send again a big batch with the new code to PythonGPU.
The app will be short of tasks for a brief period of time but even though the new version of PyTorchRL does not have huge changes I don't want to risk updating it now while 3000+ tasks are still on the queue.
I will make a post once I submit the Beta tasks.
thanks abouh! looking forward to testing out the new batch.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Can http://www.gpugrid.net/apps.php link be put next to Server status link?
I'd like to see this change in the website design also.
Would be much easier for access than having to manually edit the URL or find the one apps link in the main project JoinUs page. |
|
|
|
Can http://www.gpugrid.net/apps.php link be put next to Server status link?
You might want to repost that on the wish list thread so it's there when the webmaster gets around to updating the site.
I fear they may be too busy at this time. I went ahead and put a link in my browser until then.
Thanks for posting that page link. |
|
|
|
Right now: ~ 14.200 "unsent" Python tasks in the queue.
I guess it will take a while until they all are processed.
now down to less than 500. these went much quicker than I anticipated. only about 3 weeks.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
So what again is going to be the status of the expected new application?
Beta to start with?
Removal of wandb?
New nthreads value?
New job_xxx.xml file?
New compilation for Ada devices? |
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Will the new app be fine on 1 CPU core or will it still require many? on my Windows box atm I have to manually allocate 24 cores to the WU so it does not get starved with other projects running at the same time. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Pretty sure you are confusing cores with processes. The app will still spin out 32 python processes. Processes are not cores.
But from testing of the modified job.xml file, the new app will probably need as few as 4 cores/threads to run. |
|
|
|
There are two separate mechanisms with this app spinning up multiple processes/threads. The fix will only reduce one of them. Since each task is training 32x agents at once, those 32 processes still spin up. The fix I helped uncover only addresses the unnecessary extra CPU usage from the n-cores extra processes spinning up. I’ve been running with those capped at 4. And it seems fine.
About Ada support, since this app is not really an “app” as it’s not a compiled binary, but just a script, it works fine with Ada already according to some other users running it on their 40-series cards. It’s the Acemd3 app that needs to be recompiled for Ada.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The job_xxx.xml will also remain the same, since the instructions are as simple as:
- 1. unpack the conda python environment with all required dependencies.
- 2. run the provided python script.
- 3. return result files.
So I am only changing the provided python script.
As Ian mentioned, it is not a compiled app. The only difference is that the packed conda environment contains cuda10 (10.2.89) or cuda11 (11.3.1) depending on the host GPU.
Is that enough to support ADA GPUs?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Only 75 jobs in the queue! Thank you all for your support :)
I imagine will be all processed today. So as I mentioned in an earlier post, the next steps will be the following:
1- I will release a new version of our Reinforcement Learning library (https://github.com/PyTorchRL/pytorchrl), used in the python scripts to instantiate and train the AI agents.
2- I will send a small batch of PythonGPUBeta jobs with the new python script and also using the new version of the library.
3- If everything goes well, start sending PythonGPU tasks again.
I am interested in your feedback regarding whether or not the new scripts configuration is helpful in terms of efficiency. In my machine seems to work fine.
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Yea it spins up that many processes but if I leave the app at default it will get choked because Boinc will only allocate 1 thread to it and the other projects running will take up the other 31 threads.
I manually allocate it 24 threads as this is about what I observed it running when I only ran that task and nothing else, this stops it from getting choked when running multiple projects.
What I would like to see is the app download and allocate however many threads it needs to complete the task automatically without needing a custom app_config file. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Yea it spins up that many processes but if I leave the app at default it will get choked because Boinc will only allocate 1 thread to it and the other projects running will take up the other 31 threads.
I manually allocate it 24 threads as this is about what I observed it running when I only ran that task and nothing else, this stops it from getting choked when running multiple projects.
What I would like to see is the app download and allocate however many threads it needs to complete the task automatically without needing a custom app_config file.
I, second that. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I just released the new version of the python library and sent the beta tasks.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Is there any BOINC specifiable WU parameter for that? I could not find it but I would also like to avoid to the hosts having to manually change configuration if possible
____________
|
|
|
|
Use this
<app_config>
<app>
<name>PythonGPU</name>
<plan_class>cuda1131</plan_class>
<gpu_versions>
<cpu_usage>8</cpu_usage>
<gpu_usage>1</gpu_usage>
</gpu_versions>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
</app>
</app_config> |
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Just grabbed one of the beta units and it still says Running (0.999 CPUs and 1 GPU) but it seems to be fluctuating between 50% and 100% load on my 32-thread CPU.
If the app is spinning up a ton of processes that need their own threads can the app reflect that and allocate however many threads are needed, please? so for example it should say "Running (32 CPUs and 1 GPU)" or however many it needs.
Would simplify things and I assume cut down on failed units from users who do not know the app spins up more than one process and run it on a single thread with other apps taking up the remainder.
Thanks
Edit after an initial 100% utilisation spike it's now settled down at around 30% - 40% CPU utilisation. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
But this is on the client side.
On the server side I see I can adjust these parameters for a given app: https://boinc.berkeley.edu/trac/wiki/JobIn
I am open to implement both solutions:
1- Force from the server side that host have more than 1 cpu, 4-8 for example (the tasks spawn 32 python threads but not 32 cpus are required to run them successfully). In case that is possible, but on the server I could not find any option to specify that so far..
2- Specify that 32 processes are being created. I can add it to the logs, but where else can I mention it so users are aware?
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I don't see any parameter in the jobin page that allocates the number of cpus the task will tie up.
I don't know how the cpu resource is calculated. Must be internal in BOINC.
Richard Haselgrove probably knows the answer.
It varies among projects I've noticed. I think it is calculated internally in BOINC based on client benchmarks rating and the rsc_fpops_est value the work generator assigns tasks.
The user has been able to override the project default values with their own values via the app_config mechanismm.
But these values don't actually control how an app runs. Only the science app determines how much resources the task takes.
The cpu_usage value is only a way to help the client determine how many tasks can be run for scheduling purposes and how much work should be downloaded.
I'm currently running one of the beta tasks and it either runs faster or the workunit is smaller than normal. Probably the latter being beta.
I notice 3 processes running run.py on the task along with the 32 spawned processes. I don't remember the previous app spinning up more than the one run.py process.
I wonder if the 3 run.py processes are tied into my <cpu_usage>3.0</cpu_usage> setting in my app_config.xml file. |
|
|
|
I notice 3 processes running run.py on the task along with the 32 spawned processes. I don't remember the previous app spinning up more than the one run.py process.
I wonder if the 3 run.py processes are tied into my <cpu_usage>3.0</cpu_usage> setting in my app_config.xml file.
as you said earlier in your comment, the cpu_use only tells BOINC how much is being used. it does not exert any kind of "control" over the application directly.
the previous tasks spun up a run.py child process for every core. these would be linked to the parent process. you can see them in htop.
I have not been able to get any of these beta tasks myself (i got some very early morning before I got up, but they errored because of my custom edits) to see what might be going on. but there might be a problem with them still, some other users that got them seem to have errored.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I reset the project on all hosts prior to the release of the beta tasks to start with a clean slate.
I have one of the beta tasks running well so far. 6.5 hrs in so far at 75% completion.
GPUGRID 1.12 Python apps for GPU hosts beta (cuda1131) e00001a00027-ABOU_rnd_ppod_expand_demos29_betatest-0-1-RND7327_1 06:22:55 (15:21:33) 240.67 79.210 78d,21:06:03 1/30/2023 3:14:52 AM 0.998C + 1NV (d0) Running High P. Darksider
I looked at this tasks in htop and it is different than before. I am not talking about the 32 spawned python processes. I was referring to 3 separate run.py process PID's that are using about 20% cpu each besides the main one.
I hadn't configured my app_config.xml for the PythonGPUbeta before I picked up the task so I ended up with the default 0.998C core usage value rather than my normal 3.0 cpu value I have for the regular Python on GPU tasks. |
|
|
|
what you're showing in your screenshot is exactly what I saw before. the "green" processes are representative of the child processes. before, you would have a number of child threads in the same amount as the number of cores. on my 16-core system there would be 16 children, on the 24-core system there was 24 children, on the 64 core system there was 64 children. and so on, for each running task.
if you move the selected line but pushing the down arrows or select one of the child processes with the cursor, you should see the top line as white text, which is the parent main process. this is all normal.
check my screenshots from this message: https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59239
____________
|
|
|
|
I don't see any parameter in the jobin page that allocates the number of cpus the task will tie up.
I don't know how the cpu resource is calculated. Must be internal in BOINC.
Richard Haselgrove probably knows the answer.
You're right - it doesn't belong there. It will be set in the <app_version>, through the plan class - see https://boinc.berkeley.edu/trac/wiki/AppPlanSpec.
And to amplify Ian's point - not only does BOINC not control the application, it merely allows the specified amount of free resource in which to run. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Since these apps aren't proper BOINC MT or multi-threaded apps using a MT plan class, you wouldn't be using the <max_threads>N [M]</max_threads> parameter.
Seems like the proper parameter to use would be the <cpu_frac>x</cpu_frac> one.
Do you concur? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Bunch of the standard Python 4.03 versioned tasks have been going out and erroring out. I've had five so far today.
Problems in the main learner step with the pytorchl packages.
https://www.gpugrid.net/result.php?resultid=33268830 |
|
|
|
Maybe this might help Abou with the scripting, I'm too green at Python to know.
https://stackoverflow.com/questions/58666537/error-while-feeding-tf-dataset-to-fit-keyerror-embedding-input |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
I was allocated two tasks of "ATM: Free energy calculations of protein-ligand binding v1.11 (cuda1121)" and both of them were cancelled by the server in transmission. What are these tasks about and why were they cancelled? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
The researcher cancelled them because they recognized a problem with how the package was put together and the tasks would fail.
So better to cancel them in the pipeline rather than waste download bandwidth and the cruncher's resources.
You can thank them for being responsible and diligent. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I successfully ran one of the beta Python tasks after the first cruncher errored out the task.
https://www.gpugrid.net/result.php?resultid=33268305 |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The beta tasks were of the same size as the normal ones. So if they run faster hopefully the future PythonGPU tasks will too.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Thank you very much for pointing it out. Will look at the error this morning!
____________
|
|
|
|
finally got some more beta tasks and they seem to be running fine. and now limited to only 4 threads on the main run.py process.
but i did notice that VRAM use has increased by about 30%. tasks are now using more than 4GB on the GPU, before it was about 3.2GB. was that intentional?
are these beta tasks going to be the same as the new batch? beta is running fine but the small amount of new ones going out non-beta seem to be failing.
____________
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
I never blamed anyone. Just asked a question for my own knowledge. Anyway, Thank you. Now I wish I could get a task. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
...
but i did notice that VRAM use has increased by about 30%. tasks are now using more than 4GB on the GPU, before it was about 3.2GB. was that intentional?
...
this is definitely bad news for GPUs with 8GB VRAM, like the two RTX3070 in my case. Before, I could run 2 tasks each GPU. It became quite tight, but it worked (with some 70-100MB left on the GPU the monitor is connected to).
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes, these latest beta tasks use a little bit more GPU memory. The AI agent has a bigger neural network. Hope it is not too big and most machines can still handle it.
What about number of threads? Is it any better?
I also fixed the problems with the non-beta (queue was empty but I guess some failed jobs were added again to it after the new software version was released). Let me know if more errors occur please.
____________
|
|
|
|
i have 4 of the beta tasks running. the number of threads looks good. using 4 threads per task as specified in the run.py script.
i just got an older non-beta task resend, and it's working fine so far (after I manually edited the threads again.
but the setup with beta tasks seems viable to push out to non-beta now.
about VRAM use. so far, it seems when they first get going, they are using about 3800MB, but rises over time. at about 50% run completion the tasks are up to ~4600MB each. not sure how high they will go. the old tasks did not have this increasing VRAM as the task progressed behavior. is it necessary? or is it leaking and not cleaning up?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Great! very helpful feedback Ian thanks.
Since the scripts seem to run correctly I will start sending tasks to PythonGPU app with the current python script version.
In parallel, I will look into the VRAM increase running a few more tests in PythonGPUbeta. I don't think it is a memory leak but maybe there is way for a more efficient memory use in the code. I will dig a bit into it. Will post updates on that.
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Got one of the new Betas, it's using about 28% average of my 16core 5950x in Windows 11 so roughly 9 threads? |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
The scripts still spawn 32 python threads. But I think before with wandb and maybe without fixing some environ variables even more were spawned.
However, note that not 32 cores are necessary to run the scripts. Not sure what is the optimal number but much lower than 32.
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Yea it definitely uses less overall CPU time that before, capped the apps at 10 cores now which seems like the sweet spot to allow me to also run other apps. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
task 33269102
Eight times and it was a success. Does someone want to Post Mortem it? |
|
|
|
Great! very helpful feedback Ian thanks.
Since the scripts seem to run correctly I will start sending tasks to PythonGPU app with the current python script version.
In parallel, I will look into the VRAM increase running a few more tests in PythonGPUbeta. I don't think it is a memory leak but maybe there is way for a more efficient memory use in the code. I will dig a bit into it. Will post updates on that.
seeing up to 5.6GB VRAM use per task. but it doesnt seem consistent. some tasks will go up to ~4.8, others 4.5, etc. there doesnt seem to be a clear pattern to it.
the previous tasks were very consistent and always used exactly the same amount of VRAM.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
yesterday I downloaded and started 2 Pythons on my box with the Intel Xeon E5 2667v4 (2 CPUs) and the Quadro P5000 inside.
What I realized after some time was that the progress bars in the BOINC manager became more and more different.
Finally, one task got finished after 24 hrs + a few minutes (how nice, thus missing the <24 hrs bonus), the other task now is at 29,920%.
What I notice now, with only this one task running, is: no GPU utilization, just CPU.
Any idea how come?
I guess this task is invalid and I should abort it, right?
BTW: with the other task which worked fine I could not see any increasing VRAM usage. It stayed at about 3.5GB all time long. |
|
|
|
Old low(er) VRAM use tasks are still going out.
The old tasks have “test” in the WU name, and have the same VRAM use, and high CPU use as before.
The new tasks have “exp” in the name, have less CPU used, but more VRAM.
And the new windows app could be acting differently than the Linux version.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
thanks for the hint regarding "old" and "new" tasks.
The 2 which I downloaded yesterday were "new" ones with "exp" in the name.
Right now, I have running 4 new ones in parallel (I was surprised that they were downladed while the server status page is has been showing "0 unsent" for quite a while).
According to the Windows task manager, they seem to run well, although I cannot tell for sure at this early point whether they all use the GPU. I will be able to tell better from the progress bar after some more time (while at least one looks suspicious at this time).
VRAM use at this point is 9.180 MB (including whatever the monitor uses). |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
According to the Windows task manager, they seem to run well, although I cannot tell for sure at this early point whether they all use the GPU. I will be able to tell better from the progress bar after some more time (while at least one looks suspicious at this time).
is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?
|
|
|
|
as a reference, this is what it's looking like running 3 tasks on 4x A4000s. a good amount of variance in VRAM use. not consistent and I'm not sure if it increases over time, or some tasks just require more than others. but definitely more than before and different behavior than before.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?
Yes, use nvidia-smi which is installed by the Nvidia drivers.
It is located here in Windows.
C:\Program Files\NVIDIA Corporation\NVSMI
Just open a command window and navigate there and execute:
nvidia-smi |
|
|
|
is there any other way to find out whether a task is using the GPU at all, except for watching the BOINC Manager progress bar for a while and comparing to each other the progress of the individual running tasks?
Yes, use nvidia-smi which is installed by the Nvidia drivers.
It is located here in Windows.
C:\Program Files\NVIDIA Corporation\NVSMI
Just open a command window and navigate there and execute:
nvidia-smi
he might look here too. your location is reported to be on older installs.
C:\Windows\System32\DriverStore\FileRepository\nvdm*\nvidia-smi.exe
i think he needs to include the extension. but yes.
nvidia-smi.exe
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
thank you very much, folks, for your help with the Nvidia-SMI.
BTW, on my host it is located here: C:\Program Files\NVIDIA Corporation\NVSMI
However, what I get is "access denied", even when opening the command window as administrator. No idea what the problem is.
But anyway, having been able to watch the progress bar in the BOINC Manager, by now I can clearly tell the following:
just to explain how I startet out with the Pythons last year when they were introduced:
I spoofed the GPU which gave me the ability to run 4 Pythons simultaneously.
With the hardware:
Intel Xeon E5 2667v4 (2 CPUs) and the Quadro P5000 (16GB VRAM) and 256GB system RAM
this was performing fine, over all the months.
Now, when running 4 tasks simultaneously, I notice that the 2 tasks running on "device 0" are about 3 times faster than the 2 tasks running on "device 1".
Which seems to indicate very clearly that the 2 tasks on "device 1" are NOT utilizing the GPU.
Since I made no changes neither in the hardware, nor in the software, nor in any relevant settings vis-a-avis before, the reason for this behaviour must be related to the code of the new Pythons :-( All 4 task are "new" ones.
Or does anyone have any other ideas?
BTW: on the other host with the 2 RTX3070 inside, so far I got downloaded and started 3 Pythons, however they are from the "old" series. And all three are running with the same speed, i.e. utilizing the GPUs. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
further on my posting above:
I just want to point out that the same problem exists even if only 2 of the new Pythons are crunched simultaneously (one on "device 0", the other on "device 1) - see my posing here:
https://www.gpugrid.net/forum_thread.php?id=5233&nowrap=true#59825 |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
thank you very much, folks, for your help with the Nvidia-SMI.
BTW, on my host it is located here: C:\Program Files\NVIDIA Corporation\NVSMI
However, what I get is "access denied", even when opening the command window as administrator. No idea what the problem is.
The access denied is obviously a permission issue. I don't know how to view the properties of a file in Windows. Maybe right-click? Does that show you who "owns" the file?
Windows probably has the same ownership options or close enough to Linux where a file has permissions at the system level, the group level and the user level.
Maybe the Windows version of nvidia-smi.exe belongs to a Nvidia group which the local user is not a member. Maybe investigate adding the user to the Nvidia group to see if that changes whether the file can be executed. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
thank you, Keith, for your reply re the Nvidia-SMI. I will investigate further tomorrow.
However, by now, looking at the progress bars, it seems evident enough that the new Python version obviously has a problem with spoofed GPUs. Either by design or by accident. Maybe abouh can tell more |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
for the time being, I excluded "device 1" from GPUGRID via setting in the cc_config.xml
So, when downloading Pythons tasks next time, only 2 should come in and be processed by "device 0" (with app_config.xml setting "0.5 GPU usage").
Further, I guess I could not process 4 tasks (from the new type) simultaneously anyway as I can see from the currently running 2 tasks that they are using 12.367 MB VRAM. So not even 1 additional task would work, with the GPU having 16 GB VRAM.
On the other host with the 2 RTX3070 (8 GB VRAM ea.) on which I ran 4 tasks in parallel before, I will now have the problem that only 1 task per GPU can be processed, due to the higher VRAM need. Which is a pitty :-(
And I guess even GPUs with 12 GB VRAM may NOT be able to process 2 new Pythons simultaneously. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
this is one of the Pythons which had only CPU utilization, but NOT GPU utilization. So I aborted it after several hours.
https://www.gpugrid.net/result.php?resultid=33271058
Does the stderr show by any chance why the GPU was not utilized?
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello Erich,
By design an environment variable defines which GPU is the task supposed to use (in run.py line 429):
os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["GPU_DEVICE_NUM"]
Then, the PyTorchRL package tries to detect that specified GPU, and otherwise uses CPU. So if no GPU is detected it can happen what you mention, that CPU is used instead and the task progress becomes much slower.
What I can do is add an additional logging message in the run.py scripts that will display whether or not the GPU device was detected. So we will know for sure.
Furthermore, I have found a way to reduce at least a bit the GPU memory requirements. I will start using in the newly submitted tasks.
____________
|
|
|
|
Thanks abouh!
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
hello abouh,
thanks for your quick reply.
So, as it seems, the situation is such that the tasks from new Python version detect the "real" GPU ("device 0") but do NOT detect the spoofed GPU ("device 1"), for what reason ever.
In the former Python version, both GPUs were detected without any problem.
However, I now found a workaround which also works well:
I excluded "device 1" in the cc_config.xml of BOINC, and I set the GPU usage to "0.3" in the app_config.xml of GPUGRID.
This enables to run 3 Pythons simultaneously. In theory, I could run even 4 tasks by setting the GPU usage to "0.25", but from what I can see now, with 3 tasks running, the VRAM is filled with 16.307MB out of VRAM size 16.384MB.
The progress of the 3 tasks at this moment is 38%, 24% and 22% (they were downloaded at different times), So I can only hope that VRAM utilization will not increase any more in course of task processing.
On another host, I have two Pythons running in parallel, with total VRAM use 6.125 MB out of 6.144 MB available :-)
So, if you mention that you found a way to reduce VRAM requirements a little bit, this will definitely help :-) |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Abouh,
This task series and all its bethren have a configuration error and they all are failing very fast.
https://www.gpugrid.net/result.php?resultid=33273094
I've chalked up over 40 errors today and all the wingmen are failing the series in the same way.
File "run.py", line 97, in main
demo_dtypes={prl.OBS: np.uint8, prl.ACT: np.int8, prl.REW: np.float16, "StateEmbeddings": np.int16},
TypeError: create_factory() got an unexpected keyword argument 'state_embeds_to_avoid' |
|
|
|
Keith, you need to remove your "tweaking". it's trying to replace the run.py script workaround thing that we were doing before. the old run.py script is not compatible with the new tasks.
you must have forgotten to reset the project on this one host. your other hosts have run the new tasks OK.
i have many of the new tasks running just fine.
and memory use is improved. thanks abouh.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Keith, you need to remove your "tweaking". it's trying to replace the run.py script workaround thing that we were doing before. the old run.py script is not compatible with the new tasks.
you must have forgotten to reset the project on this one host. your other hosts have run the new tasks OK.
i have many of the new tasks running just fine.
and memory use is improved. thanks abouh.
Nope. Absolutely NOT the case. The run.py is the one provided by the project.
Look at the link I provided, every other wingman is failing the task also. Along with all the other failed tasks.
I'm damn sure I reset the project. Resetting again.
|
|
|
|
Your stederr output from your failed task in your link clearly indicated that it copied the run.py file. Or was still trying to.
13:00:27 (3925992): wrapper: running /bin/cp (/home/keith/Desktop/BOINC/projects/www.gpugrid.net/run.py run.py)
13:00:28 (3925992): /bin/cp exited; CPU time 0.000962
The only way it would be doing that is because you’re still running my edited file, a project reset would have erased that and been replaced with the standard version.
The other hosts that failed, failed for different reasons. They got unlucky and hit hosts with incompatible GPUs.
____________
|
|
|
|
Is there any way to reduce the estimated remaining time showing in the manager on these?
I'm seeing 20+ days left when there is really 10 hours and I can't download the next task until the previous is well past 90%. That's around an hour or so in advance as they are finishing in 9-12 hrs. on my hosts.
Setting my task que longer only gets me the server message:
Tasks won't finish in time: BOINC runs 100.0% of the time; computation is enabled 99.9% of that
It appears that the server sets completion times based on the average among completed WU run times. Seeing that Pythons misreport the run time (which must be equal to or greater than the CPU time) it is logical that the estimated future completion times would reflect the inflated CPU time figures.
Is there a local manager config fix for that, anyone? Multi gratis |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello Keith,
I also think for some reason your machine ran the old run.py file. Maybe the error was in the server side and the old script was provided by some reason, but I went and checked the files in task e00004a01419-ABOU_rnd_ppod_expand_demos29_exp1-0-1-RND6419_2 and the error should not be present.
Also, as I mentioned in a previous post I added and extra log message to check if a GPU is detected
sys.stderr.write(f"Detected GPUs: {gpus_detected}\n")
Which is only printed with the new runs but not in the one you shared. For example, in this one:
https://www.gpugrid.net/result.php?resultid=33273691
____________
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Is there any way to reduce the estimated remaining time showing in the manager on these?
I'm seeing 20+ days left when there is really 10 hours and I can't download the next task until the previous is well past 90%. That's around an hour or so in advance as they are finishing in 9-12 hrs. on my hosts.
Setting my task que longer only gets me the server message:
Tasks won't finish in time: BOINC runs 100.0% of the time; computation is enabled 99.9% of that
It appears that the server sets completion times based on the average among completed WU run times. Seeing that Pythons misreport the run time (which must be equal to or greater than the CPU time) it is logical that the estimated future completion times would reflect the inflated CPU time figures.
Is there a local manager config fix for that, anyone? Multi gratis
___________________
I will agree with Pop. The same thing is going on, on my machine. |
|
|
|
abouh, are you planning to release another large batch of the new tasks?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes! The experiment I am currently running has a population of 1000 (so it maintains the number of submitted tasks to 1000 by sending a task every time 1 ends until a certain global goal is reached)
I am planning to start another 1000 agent experiment, probably tomorrow.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Unfortunately I think the best reference is the progress %. I don't know if that is of much help to calculate at what time a task will end, but the progress increase should be constant as long as the machine load is also constant throughout the task.
____________
|
|
|
|
Thanks abouh,
When there is a steady flow of tasks from Grosso the window of time a host needs to secure a new WU and maintain constant production shrinks drastically.
There is no need then to try to download the next task until the present task is almost finished, so the estimate of time remaining becomes of little concern to me. |
|
|
|
Unfortunately I think the best reference is the progress %. I don't know if that is of much help to calculate at what time a task will end, but the progress increase should be constant as long as the machine load is also constant throughout the task.
And that can be easily utilised by setting
fraction_done_exact
if set, base estimates of remaining time solely on the fraction done reported by the app.
in app_config.xml. It wobbles a bit at the beginning, but soon settles down. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Unfortunately I think the best reference is the progress %. I don't know if that is of much help to calculate when a task will end, but the progress increase should be constant as long as the machine load is also constant throughout the task.
And that can be easily utilised by setting
fraction_done_exact
if set, base estimates of remaining time solely on the fraction done reported by the app.
in app_config.xml. It wobbles a bit at the beginning but soon settles down.
______________
Richard, could you point me in the direction of app_config.xml. Where is it? Second, are we not playing around a bit too much? No other project requires us to play. Unless we are up to mischief and try to run multiple WU's at the same time and when they start crashing, blame others. |
|
|
|
It's documented in the User manual, specifically at:
https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
It's documented in the User manual, specifically at:
https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration
_____________
That I can find but not on my computer unless it is hidden. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
_____________
That I can find but not on my computer unless it is hidden.
the app_config.xml is not there automatically, that's why you won't find it.
You need to write it yourself e.g. by the Editor or Notepad, than save it as "app_config.xml" in the GPUGRID project folder within the BOINC folder (contained in the ProgramData folder).
In order to put the app_config.xml into effekt, after having done the above mentioned steps, you need to open the "Options" tab in the BOINC manager and push once "read config files". |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
_____________
That I can find but not on my computer unless it is hidden.
the app_config.xml is not there automatically, that's why you won't find it.
You need to write it yourself e.g. by the Editor or Notepad, than save it as "app_config.xml" in the GPUGRID project folder within the BOINC folder (contained in the ProgramData folder).
In order to put the app_config.xml into effekt, after having done the above mentioned steps, you need to open the "Options" tab in the BOINC manager and push once "read config files".
_______________________________
Is this Boinc in only one place?
OS(C)
Program files
Boinc
Locale
Skins
Boinc
boinc_logo_black
Boinccmd
Boincmgr
Boincscr
boincsvcctrl
boinctray
ca-bundle
COPYING
COPYRIGHT
liberationSans-Regular.
This is all I can find in the Boinc folder. no GPUGrid folder or any other project folder.
Unless, Boinc is in two places like in the old days. |
|
|
|
Thank you Richard Hazelgrove,
It's documented in the User manual, specifically at:
https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration
I was unaware of that info site. Will cure my ignorance.
Many thanks.
____________
"Together we crunch
To check out a hunch
And wish all our credit
Could just buy us lunch"
Piasa Tribe - Illini Nation |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
_______________________________
Is this Boinc in only one place?
OS(C)
Program files
Boinc
.,..
This is all I can find in the Boinc folder. no GPUGrid folder or any other project folder.
Unless, Boinc is in two places like in the old days.
You are in the wrong folder.
BOINC still is in two places.
Your have to navigate to C:/ProgramData/BOINC/projects/GPUGRID
|
|
|
|
Hi KAMasud,
The folder mentioned by Erich56 here...
Your have to navigate to C:/ProgramData/BOINC/projects/GPUGRID
...is a hidden folder in windows so you must choose to show hidden folders in the file preferences to access it. (just in case you might not know that and can't see it in the program manager)
[Edit] The target folder for the appconfig.xml file is actually
C:\ProgramData\BOINC\projects\www.gpugrid.net
on my hosts
Hope that helped. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
[Edit] The target folder for the appconfig.xml file is actually
C:\ProgramData\BOINC\projects\www.gpugrid.net
yes, this is true. Sorry for the confusion.
However, on my system (Windows 10 Pro) this folder is NOT a hidden folder.
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Hi KAMasud,
The folder mentioned by Erich56 here...
Your have to navigate to C:/ProgramData/BOINC/projects/GPUGRID
...is a hidden folder in windows so you must choose to show hidden folders in the file preferences to access it. (just in case you might not know that and can't see it in the program manager)
[Edit] The target folder for the appconfig.xml file is actually
C:\ProgramData\BOINC\projects\www.gpugrid.net
on my hosts
Hope that helped.
____________________
Thank you, Pop. After this update from Microsoft, Windows has become ____. Needs Administrator privileges for everything. Even though it is my private computer. Yesterday, I marked show hidden folders it promptly hid them back. Today I unhid them and told it "stay", good doggy. I found the second folder in which I found the projects folders. Thank you everyone.
Fat32 was better in some ways. I have done what you all wanted me to do but years ago. Maybe two decades ago.
Erich, Richard, thank you.
Apple products are user repair unfriendly. Laptops are becoming repair unfriendly and now, Microsoft, is going the same way. |
|
|
|
I've been away from GPUGrid for a while...
Is there a way to control the number of spawned threads?
I've tried to modify the line:
<setenv>NTHREADS=$NTHREADS</setenv> in linux_job.###########.xml file to
<setenv>NTHREADS=8</setenv> but it made no difference.
The task was started with the original NTHREADS setting.
Is it the reason for no change in the number of spawned threads, or I should modify something else? |
|
|
|
I've been away from GPUGrid for a while...
Is there a way to control the number of spawned threads?
I've tried to modify the line:
<setenv>NTHREADS=$NTHREADS</setenv> in linux_job.###########.xml file to
<setenv>NTHREADS=8</setenv> but it made no difference.
The task was started with the original NTHREADS setting.
Is it the reason for no change in the number of spawned threads, or I should modify something else?
there is no reason to do this anymore. they already fixed the overused CPU issue. it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that unless abouh wanted to use less agents, but i don't think he does at this time.
if you want to run python tasks, you need to account for this and just tell BOINC to reserve some extra CPU resources by setting a larger value for the cpu_usage in app_config. i use values between 8-10. but you can experiment with what you are happy with. on my python dedicated system, I stop all other CPU projects as that gives the best performance.
____________
|
|
|
|
Good to see Zoltan here again, welcome back!😀
~~~~~~~~~~~~
I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble. |
|
|
|
they already fixed the overused CPU issue. it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents. there is no way to reduce that unless abouh wanted to use less agents, but i don't think he does at this time.
I am enjoying watching abouh gain prowess at scripting with each run, using less and less resources as they evolve. Real progress. Godspeed to abouh and crew. |
|
|
|
Is there a way to control the number of spawned threads? there is no reason to do this anymore. My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads).
Could the GPU usage be increased somehow?
it's now capped at 4x CPU threads and hard coded in the run.py script. but that is in addition to the 32 threads for the agents.
there is no way to reduce that ... I confirm that. I looked into that script, though I'm not very familiar with python. I've even tried to modify the num_env_processes in conf.yaml, but this file gets overwritten every time I restart the task, even though I removed the rights of the boinc user and the boinc group to write that file. :)
if you want to run python tasks, you need to account for this and just tell BOINC to reserve some extra CPU resources by setting a larger value for the cpu_usage in app_config. i use values between 8-10. but you can experiment with what you are happy with. on my python dedicated system, I stop all other CPU projects as that gives the best performance. That's clear I did that. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Good to see you Zoltan. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Good to see Zoltan here again, welcome back!😀
~~~~~~~~~~~~
I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows, I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble.
Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Good to see Zoltan here again, welcome back!😀
~~~~~~~~~~~~
I need to correct what I reported on the program data folder to KAMasud earlier. The folder is not hidden (as Erich56 noted) but is a system folder, so in windows, I've had to enable access to system files and folders on a new install in order to see it. Just in case you're still having trouble.
Pop, there used to be two Program folders as I remember. Program and Program 32. Now there is a hidden Program System folder. Three in all. |
|
|
|
Is there a way to control the number of spawned threads? there is no reason to do this anymore. My reason to reduce their numbers is to run two tasks at the same time to increase GPU usage, because I need the full heat output of my GPUs to heat our apartment. As I saw it in "Task Manager" the CPU usage of the spawned tasks drops when I start the second task (my CPU doesn't have that many threads).
Could the GPU usage be increased somehow?
If you need the heat output of the GPU, then you need to run a different project. Or only run ACEMD3 tasks when they are available. You will not get it from the Python tasks in their current state.
You can increase the GPU use by adding more tasks concurrently. But not to the extent that you expect or need. I run 4x tasks on my A4000s but they still don’t even have full utilization. Usually only like 40% and ~100W avg power draw. Two tasks aren’t gonna cut it for increasing utilization by any substantial amount.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Good to see you Zoltan.
+1
|
|
|
|
...I need the full heat output of my GPUs to heat our apartment...
It's been a bit chilly in my basement "computer lab/mancave" running these this winter, but I'm saving power($) so I'm bearing it. I just hope they last into summer so I can stay cool here in the humid Mississippi river valley of Illinois.
I've had some success running Einstein GPU tasks concurrently with Pythons and saw full GPU usage, although there is of course a longer completion time for both tasks.
|
|
|
|
If you need the heat output of the GPU, then you need to run a different project. I came to that conclusion, again.
Or only run ACEMD3 tasks when they are available. I caught 2 or 3, that's why I put 3 host back to GPUGrid.
You will not get it [the full GPU heat output] from the Python tasks in their current state. That's regrettable, but it could be ok for me this spring.
My main issue with the python app is that I think there's no point running that many spawned (training) threads, as their total (combined) memory access operations cause massive amount of CPU L3 cache misses, hindering each other's performace.
Before I've put my i9-12900F host back to GPUGrid, I run 7 TN-Grid tasks + 1 FAH GPU task simultaneously on that host, the average processing time was 4080-4200 sec for the TN-Grid tasks.
Now I run 1 GPUGrid task + 1 TN-Grid task simultaneously, and the processing time of the TN-Grid task went up to 4660-4770 sec. Compared to the 6 other TN-Grid tasks plus a FAH task the GPUGrid python task cause a 14% performance loss.
You can see the change in processing times for yourself here.
If I run only 1 TN-Grid task (no GPU tasks) on that host, the processing time is 3800 seconds. Compared to that, running a GPUGrid pythnon task cause a 22% performance loss.
Perhaps this app should do a short benchmark of the given CPU it's actually running on to establish the ideal number of training threads, or give some control of that number for the advanced users like me :) to do that benchmarking of their respective systems. |
|
|
|
I don't think you understand what the intention of the researcher is here. he wants 32 agents and the whole experiment is designed around 32 agents. and agent training happens on the CPU, so each agent needs its own process. you can't just arbitrarily reduce this number without the researcher making the change for everyone. it would fundamentally change the research. you could only reduce the number of agents with a new/different experiment.
or make MASSIVE changes to the code to push it all into the GPU, but likely most GPUs wouldn't have enough VRAM to run it and everyone would be complaining about that instead.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello everyone,
this is exactly correct, agents collect data from their interaction with the environment (running on CPU), and the data is posteriorly used to update the neural network that controls action selection (on GPU).
Having multiple agents allows to collect data in parallel, speeding up training.
____________
|
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
I think I am going a bit mad, I set the app_config file to use 0.33 GPU to try and get more units running at the same time, I then remembered 2 is the max, however this config when running 2 seemed to go faster, units completed 25% in about 3 hours, normally I think the units take a lot longer than this.
I will need to take a week to so to double-check this though.
What's the optimal config at the moment? this is my current one:
<app_config>
<app>
<name>PythonGPU</name>
<gpu_versions>
<cpu_usage>8</cpu_usage>
<gpu_usage>0.5</gpu_usage>
</gpu_versions>
</app>
</app_config>
|
|
|
|
Ryan, here's what works for me:
<app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd3</name>
<max_concurrent>2</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<project_max_concurrent>2</project_max_concurrent>
<report_results_immediately/>
</app_config>
You can change the numbers whenever ACEMDs are available and allow them to run concurrent with a Python.
You will need to adjust the CPU figures to match your present appconfig.
(Many thanks Richard Hazelgrove, for helping me upthread) |
|
|
Ryan MunroSend message
Joined: 6 Mar 18 Posts: 33 Credit: 1,055,322,577 RAC: 8,390,045 Level
Scientific publications
|
Thanks, is 1 CPU per python unit enough? what times are you getting per unit? when I run 8 threads per unit and other tasks on the spare threads my CPU is always running at 100%. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
It is not about how many threads your machine has, it is about how many tasks you can run alongside a Python. I have a six-core, twelve threads but can only run three Einstein WUs and my CPU peaks at 82%. A fine balancing act is required and sometimes a GPUGrid WU arrives and I have to suspend other work.
I have also reached the limit of my 16GB RAM(sometimes) other times? These AI WUs seem to be outdoing us. Monitoring is also required. Pop will explain. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Anybody else getting sent Python tasks for the old 1121 app? I have been using the newer 1131 app and it has worked fine on all tasks.
I don't even have the old 1121 app anymore since I did a project reset to use the new python job file for reduced cpu usage.
The 1121 app tasks are instant erroring out. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Anybody else getting sent Python tasks for the old 1121 app?
not so far
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Based on the number of _x issues of these tasks and everyone else erroring out, must be a scheduler issue. |
|
|
|
I've received some of them so far. they fail within like 10 seconds.
looks like someone at the project put the old v4.01 linux app up. these seem not compatible with the new experiment. I'm guessing someone enabled that application by accident.
abouh, you probably need to pull this app version back down to prevent it from being sent out. and leave the working v4.03 up.
____________
|
|
|
|
is 1 CPU per python unit enough?
Ryan, you have a professional market CPU so I can't tell you from experience. Also, I haven't experimented with the CPU figures much yet.
I run 1 Python at a time because my hosts are limited in comparison to yours.
Seeing your host it looks to me like you can run 2 Pythons simultaneously.
(Perhaps Erich56 might share how he manages his very capable i-9 windows host.)
what times are you getting per unit?
When left to run with no competition for CPU time, my hosts finish a Python task in somewhere between 9 and 12 hrs., depending on the host's CPU.
I've found that running either a CPU task or a second GPU task along side of a Python slows it down noticeably, adding an hour or two to the observed run time. This is quite acceptable in my opinion if running one of the ACEMD tasks concurrently, whenever they're available. |
|
|
|
Anybody else getting sent Python tasks for the old 1121 app?
...
The 1121 app tasks are instant erroring out. I had four. All have failed on my host, but one of them finished on the 7th resend.
Edit: because that was the 1131 app. |
|
|
|
Anybody else getting sent Python tasks for the old 1121 app?
...
The 1121 app tasks are instant erroring out. I had four. All have failed on my host, but one of them finished on the 7th resend.
notice that the host that finished it was with the working v4.03 app. not the troublesome v4.01.
the problem is the app that gets assigned to the task, not the task itself.
the v4.01 linux app needs to be pulled from the apps list so the scheduler stops trying to use it.
____________
|
|
|
|
i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.
hopefully someone from the project notices these posts to take it down soon.
____________
|
|
|
|
Does anyone have problems running gpugrid with latest windows update?
[Version 10.0.22621.1265]
I had to revert it. |
|
|
|
i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.
Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start.
I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why. |
|
|
|
i've aborted probably about 100 of these tasks getting assigned the bad 4.01 app.
Ian , I've noticed that you had sent back a couple of the tasks I finished. I Thought you were doing as I do and aborting those that won't finish in 24hrs before they start.
I am guessing that the error in the script doesn't corrupt the app in windows somehow. I wish I knew why.
the error is not with the script or task configuration at all.
the problem is the application version that the project is sending.
Windows only has one app version, v4.04. Windows hosts will not see a problem with this.
Linux used to have only one also, v4.03 which works fine. but something happened a few days ago where the project put up the old v4.01 app for linux from 2021. the scheduler will try to send this app randomly to compatible hosts (any app currently able to run cuda 1131 can also run 1121, so it will send one or the other by chance). this is the problem. it's randomly sending some tasks assigned with the v4.01 app which is not compatible with these newer tasks.
https://gpugrid.net/apps.php
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right?
application ./gpugridpy/bin/python missing
I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed.
I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened.
____________
|
|
|
|
I've been away for a few days, concentrating on another project, and came back to this. I still have the v4.03 files (although I'd reset away the v4.01 files).
So, experimentally, I allowed new work, and suspended the single task issued before it had finished downloading. I got task 33308822 - a _6 resend issued with a new copy of the v4.01 files.
So, I stopped BOINC, and carefully edited client_state.xml: the version number to 403 in both <workunit> and <result>, and the plan_class to 1131 in <result> (three changes in all). It's running normally now: we'll see what happens when it reports in about 8 hours time.
Edit: the _5 replication (task 33308656) was issued as version 4.03, but failed because file pythongpu_x86_64-pc-linux-gnu__cuda1131.tar.bz2 couldn't be found. That needs to be checked on the server - are the app_version files still there? |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
I it [sic] so weird that suddenly jobs are sent to [sic] the wrong app version I haven't run python WUs in a while but when I started them today I first got a pair of 4.01s that both failed and had this message:
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0
Please update conda by running
$ conda update -n base -c defaults conda
The next WUs that replaced them were 4.03s and are running fine. Not sure how to check if I now have 23.1.0 installed. |
|
|
|
I it so weird that suddenly jobs are sent to the wrong app version. But you are right, I checked some jobs and for some reason they were sent to the wrong version... The error is the following right?
application ./gpugridpy/bin/python missing
I did not change the run.py scripts code in the last 2-3 weeks and definitely did not change the scheduler. I also asked the project admins and said the scheduler had not been changed.
I know there has been some development recently and a new app has been deployed (ATM) but I would not expect this to affect the PythonGPU app. I will do some digging today, hopefully I can find what happened.
It’s nothing wrong with your scripts.
You need to remove the app version 4.01 from the server apps list. So it’s not an option to choose.
____________
|
|
|
|
My second machine is coming free soon, so I've downloaded a task for that one, too.
That's arrived as v4.03, so no editing necessary. If the later app has now been given top priority (as it should have been all along), that's fine by me. I agree that v4.01 should be deprecated off the apps page, but it's a less urgent task - they may still need it as evidence for the post-mortem, while they're trying to work out what went wrong. |
|
|
|
task 33308822
has finished and has been deemed to be valid. So if it happens again, and you still have the v4.03 files, changing the version numbers is a valid option.
|
|
|
|
Good thing I checked. Just got allocated two brand new tasks, created today, and they both came allocated to v4.01
I didn't manage to reach the first in time, and it errored (as expected). I did catch the second, modified it as before, and it's running under v4.03
The beginnings of a suspicion are forming in my mind, and I'll check it when the second machine is ready for another fetch. |
|
|
|
probably would be more effective to just rename/replace the job setup files (jobs.xml, and zipped package). then set <dont_check_file_sizes>. this way it will call what it thinks is the 4.01 files, but it's really calling the 4.03 files. and you wont need to be constantly stopping BOINC to edit the client state each time.
but I'm just going to keep aborting stuff until the project figures out how to de-publish the bad app. I'm not sure what the hold up or confusion is there. they publish and remove apps all the time, and I've explained the issue several times. all they need to do is remove 4.01 from the apps list. they should know exactly how to do this.
____________
|
|
|
|
It would be easier to simply delete the v4.01 <app_version> and clone the v4.03 section. Then it's just a couple of one-character changes to the version number and the plan class.
I'll try that when there's no GPUGrid task running, and I've got time to think. |
|
|
|
Well, no new Python tasks this morning, but I've got a couple of resends.
The first, on host 508381, came through as v4.03, and is running normally.
The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued.
There seems to be no rhyme nor reason to it. Take a look at the tasks for the most recent host that failed for the first resend: host 602633. That one's been sent v4.01 and v4.03 seemingly at random - which blows the theory I was trying to dream up out of the water. If there's no coherent pattern to what should be a deterministic process, I'm not surprised the project team are stumped. But the answer has to stay the same: KILL OFF v4.01 FOR GOOD. |
|
|
|
The second, on host 132158, came through as v4.01, so I tried the "cloned <app_version>" trick. That's running fine, too. But the scheduler sent a whole new <app_version> segment with the task, so I fear the cloning will be undone by the next task issued
that's exactly why I suggested to replace the archive and job.xml files with the ones from the 4.03 app (along with the dont_check_file_sizes flag), so you don't have to keep editing the client state file. with replacing the package files instead, it thinks it already has the 4.01 files and uses them unaware that they are really the 4.03 files.
but yes, what really needs to happen is the removal of 4.01 from the project side.
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
I have asked the project admins to deprecate version 4.01 and 4.02. Sorry for the delay, I could not do it myself.
I am not sure what caused the sudden change but I hope now is fixed. Please let me know if the problem continues and will try to solve it.
Happy weekend to everyone!
____________
|
|
|
|
Thanks abouh! I see that the v4.01 app is now gone from the applications page, so that should solve the issue for everyone :)
I see Python tasks are winding down. do you have another experiment lined up to last over the weekend?
____________
|
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Yes over the weekend I will review the results of the 2 experiments that just finished and start new ones. The idea is to continue like until now. With two populations of 1000 agents (task) each.
____________
|
|
|
|
And thanks from me, too. That went very smoothly, and allocation of v4.03 hasn't been disturbed. Another resend has arrived for processing when this one finishes, without manual intervention. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,755,010,632 RAC: 220,113 Level
Scientific publications
|
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0
Please update conda by running
$ conda update -n base -c defaults conda
Does anyone know if I need to install Miniconda and/or Anaconda to satisfy this error message?
E.g.: https://conda.io/projects/conda/en/latest/user-guide/install/linux.html
My Linux Mint Synaptic Package Manager can't find any program containing "conda."
Maybe this is just something for the server-side staff but then why post an error message to confuse crunchers? |
|
|
|
Maybe this is just something for the server-side staff but then why post an error message to confuse crunchers?
It's not an error, it's simply a warning - information, if you like.
The project supply the conda package (which is why Mint doesn't know about it), and they're obviously happy with the version they're using. You don't need to do anything. |
|
|
|
==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 23.1.0
Please update conda by running
$ conda update -n base -c defaults conda
Does anyone know if I need to install Miniconda and/or Anaconda to satisfy this error message?
E.g.: https://conda.io/projects/conda/en/latest/user-guide/install/linux.html
My Linux Mint Synaptic Package Manager can't find any program containing "conda."
Maybe this is just something for the server-side staff but then why post an error message to confuse crunchers?
even if you installed it, it wouldnt change anything and you'd get the same warning message. as Richard wrote, these tasks use its own environment. they do not use your locally installed conda at all. which is why they work on systems that do not have conda installed at all. this is all by design to avoid any version conflicts or dependencies on the local system. it has been this way from the beginning.
additionally, this message was only present when trying to run the old/incompatible 4.01 app. you do not get that message from the correct 4.03 app. 4.01 was re-published by accident and is an app version about 1.5 years old. it is not compatible with the design/structure/requirements of how these tasks function today. the project admins have removed this version so you wont see this problem again.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
what catches my eye:
the Pythons which I got downloaded within the past 2 days seem to use a lot less system memory than the ones before.
Has Abouh made any changes to this effect? |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
❸ Name Abouh Meaning
Exceptional qualities that make this name special are negotiation skills and an amazing sense of tact. When developed these two assets will help you get what you want and achieve all goals.
Cooperation is a key aspect of your life!
Since you are far less successful in life if you do not find a level of unity with others. Most problems that people find too tricky to solve are often no match against your ingenuity.
Harmony in your surrounding is a key to happiness and feelings of relaxation. Having friends or family fight affects you greatly in a negative sense. That is why you have the reputation of being a peacemaker(only out of necessity).
At your best you become very kind hearted, charming and full of positive energy. An amazing person to spend time with. |
|
|
feriSend message
Joined: 31 Mar 20 Posts: 2 Credit: 139,952,008 RAC: 0 Level
Scientific publications
|
hi all,
...i used to contribute to GPUgrid with 1 gtx1080ti since 2020, 4core nonHT cpu
since the mostly GPU based acemd tasks i see a lot has changed regarding the effective HW requirements,
so i m wondering what is a somewhat optimal HW setup with the current python tasks, or what is potentialy the biggest current bottleneck?
-CPU core/thread #?
-RAM size?
-RAM speed vs latency preference?
-SSD speed
...i concur ECC RAM is very needed for long runtimes/nonstop usage.
___________
Frank from Slovakia, EU |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
For Python, you don't need to worry about anything other than having 8-10 cpu cores to support the task.
Enough system memory, at least 16GB.
Enough virtual swap space, at least 50GB.
Enough VRAM on the gpu, at least 4-6GB. |
|
|
|
feri, If I might add to Keith Myers' excellent synopsis, the speed at which these tasks run appears more dependent upon CPU ability than GPU ability. You might want to consider that if you are thinking about assembling a host dedicated to running pythons and you maybe have an old GTX 1060 6GB or something else with sufficient VRAM (GTX1650) laying around.
|
|
|
feriSend message
Joined: 31 Mar 20 Posts: 2 Credit: 139,952,008 RAC: 0 Level
Scientific publications
|
..a friend of mine actualy has a gtx1060 6GB laying around
thanks for the insights |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Look at Richard Haselgrove's results with 6GB GTX 1060's |
|
|
|
Look at Richard Haselgrove's results with 6GB GTX 1060's
They're GTX 1660s, but 6 GB is right. They run fine on a setting of 3 CPUs + 1 GPU - a bit over 8 hours for the current jobs. |
|
|
RellesSend message
Joined: 1 Nov 17 Posts: 2 Credit: 22,888,075 RAC: 121 Level
Scientific publications
|
I've noticed that on the same computer (with dual boot), tasks finish almost twice as fast on Ubuntu compared to Windows. I've tried running tasks on Linux only a few days ago and did so on Windows before.
Has there been any recent change or do tasks just compute more efficiently on Linux? |
|
|
|
They have always been faster on Linux
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
... tasks finish almost twice as fast on Ubuntu compared to Windows. I've tried running tasks on Linux only a few days ago and did so on Windows before.
They have always been faster on Linux
that's correct. What surprises me though is that tasks finish almost twice as fast. I don't think that this was true before, was it?
|
|
|
RellesSend message
Joined: 1 Nov 17 Posts: 2 Credit: 22,888,075 RAC: 121 Level
Scientific publications
|
Close to 10 hours are needed on Windows and almost six on Linux. I also find the difference striking, that's why I asked |
|
|
|
Anyone having problems getting the ATM tasks to upload? I have 4 completed jobs on 3 machines trying to upload and have not been able to make contact for nearly a day now. Two tasks on one machine making that device unable to get any more work. |
|
|
GregerSend message
Joined: 6 Jan 15 Posts: 76 Credit: 24,192,102,249 RAC: 13,992,829 Level
Scientific publications
|
Got several ATM stuck in upload there is now 2 days left to deadline. |
|
|
|
I've been watching the ATMs on the linux hosts (since they won't run on my windoze machines) to find a siderr file of a finished WU to study the linux (while I try to learn it).
I haven't found one. Only 'in process'; most show previous failures which vary from host to host. I'd be interested to see one completed if anybody can post a link.
Thanks. |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
I had a couple of the ATM's finish successfully a week ago, but long cleared from the database for anyone to look at. |
|
|
GregerSend message
Joined: 6 Jan 15 Posts: 76 Credit: 24,192,102,249 RAC: 13,992,829 Level
Scientific publications
|
Here is one completed Pop Piasa
https://www.gpugrid.net/result.php?resultid=33327466 |
|
|
|
Thanks Greger, it's good to have a successful example to compare with when examining errors. I appreciate it.
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Windows here. You know, sometimes these WUs go to sleep, then I click the mouse and it starts running again. Not all WUs.
task 33333635 |
|
|
|
Maybe you can change system power settings?
Disable spinning down hard drive for example? |
|
|
|
My recent results uploaded to GPUGRID often got "Error while computing" and lost all credits, I don't know why, what should I do ?
33359888 27429785 604308
17 Mar 2023 | 13:14:49 UTC 19 Mar 2023 | 14:23:21 UTC
Error while computing 50,964.34 50,964.34 ---
Python apps for GPU hosts v4.04 (cuda1131)
19/3/2023 17:37:41 | | Starting BOINC client version 7.20.2 for windows_x86_64
19/3/2023 17:37:41 | | log flags: file_xfer, sched_ops, task
19/3/2023 17:37:41 | | Libraries: libcurl/7.84.0-DEV Schannel zlib/1.2.12
19/3/2023 17:37:41 | | Data directory: C:\ProgramData\BOINC
19/3/2023 17:37:41 | |
19/3/2023 17:37:41 | | CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 3060 (driver version 531.18, CUDA version 12.1, compute capability 8.6, 12288MB, 12288MB available, 12738 GFLOPS peak)
19/3/2023 17:37:41 | | OpenCL: NVIDIA GPU 0: NVIDIA GeForce RTX 3060 (driver version 531.18, device version OpenCL 3.0 CUDA, 12288MB, 12288MB available, 12738 GFLOPS peak)
19/3/2023 17:37:41 | | Windows processor group 0: 20 processors
19/3/2023 17:37:41 | | Host name: NGcomputer
19/3/2023 17:37:41 | | Processor: 20 GenuineIntel 12th Gen Intel(R) Core(TM) i7-12700F [Family 6 Model 151 Stepping 2]
19/3/2023 17:37:41 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 tm2 pbe fsgsbase bmi1 smep bmi2
19/3/2023 17:37:41 | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
19/3/2023 17:37:41 | | Memory: 15.76 GB physical, 63.76 GB virtual
19/3/2023 17:37:41 | | Disk: 952.93 GB total, 700.19 GB free
19/3/2023 17:37:41 | | Local time is UTC +8 hours
19/3/2023 17:37:41 | | No WSL found.
19/3/2023 17:37:41 | | VirtualBox version: 7.0.6
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
You have to look at the errored task results on the website to find why you errored.
Two of the tasks errored out because you don't have enough virtual memory available for the expansion phase where the task sets up its libraries.
On Windows it is advised to set up your system page file for at least 50GB size. |
|
|
|
You have to look at the errored task results on the website to find why you errored.
Two of the tasks errored out because you don't have enough virtual memory available for the expansion phase where the task sets up its libraries.
On Windows it is advised to set up your system page file for at least 50GB size.
Thank you very much
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
The server status shows WU's are available but my machines have received no task since yesterday. |
|
|
abouhProject administrator Project developer Project tester Project scientist Send message
Joined: 31 May 21 Posts: 200 Credit: 0 RAC: 0 Level
Scientific publications
|
Hello!
The previous population experiment ended and needed to analyse the results.
But I am starting a new experiment today.
____________
|
|
|
|
I don't understand why my task fail, why ?
Name e00002a04604-ABOU_rnd_ppod_expand_demos29_2_exp5-0-1-RND9901_1
Workunit 27434170
Created 20 Mar 2023 | 22:42:54 UTC
Sent 21 Mar 2023 | 0:31:58 UTC
Received 21 Mar 2023 | 11:05:25 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 604308
Report deadline 26 Mar 2023 | 0:31:58 UTC
Run time 13,385.72
CPU time 13,385.72
Validate state Invalid
Credit 0.00
Application version Python apps for GPU hosts v4.04 (cuda1131)
Stderr output
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
08:32:00 (19880): wrapper (7.9.26016): starting
08:32:00 (19880): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.txz -y)
7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning the drive for archives:
1 file, 1976180228 bytes (1885 MiB)
Extracting archive: pythongpu_windows_x86_64__cuda1131.txz
--
Path = pythongpu_windows_x86_64__cuda1131.txz
Type = xz
Physical Size = 1976180228
Method = LZMA2:22 CRC64
Streams = 1523
Blocks = 1523
Cluster Size = 4210688
Everything is Ok
Size: 6410311680
Compressed: 1976180228
08:34:12 (19880): .\7za.exe exited; CPU time 107.906250
08:34:12 (19880): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.txz")
08:34:13 (19880): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
08:34:13 (19880): wrapper: running .\7za.exe (x pythongpu_windows_x86_64__cuda1131.tar -y)
7-Zip (a) 22.01 (x86) : Copyright (c) 1999-2022 Igor Pavlov : 2022-07-15
Scanning the drive for archives:
1 file, 6410311680 bytes (6114 MiB)
Extracting archive: pythongpu_windows_x86_64__cuda1131.tar
--
Path = pythongpu_windows_x86_64__cuda1131.tar
Type = tar
Physical Size = 6410311680
Headers Size = 19965952
Code Page = UTF-8
Characteristics = GNU LongName ASCII
Everything is Ok
Files: 38141
Size: 6380353601
Compressed: 6410311680
08:35:04 (19880): .\7za.exe exited; CPU time 9.515625
08:35:04 (19880): wrapper: running C:\Windows\system32\cmd.exe (/C "del pythongpu_windows_x86_64__cuda1131.tar")
08:35:05 (19880): C:\Windows\system32\cmd.exe exited; CPU time 0.000000
08:35:05 (19880): wrapper: running python.exe (run.py)
Windows fix executed.
Detected GPUs: 1
Define environment factory
Define algorithm factory
Define storage factory
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {16368} normal block at 0x0000025533011E30, 8 bytes long.
Data: < 4U > 00 00 D1 34 55 02 00 00
..\lib\diagnostics_win.cpp(417) : {15114} normal block at 0x000002553306C260, 1080 bytes long.
Data: < > D8 1D 00 00 CD CD CD CD 8C 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {550} normal block at 0x0000025532FFBE70, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{536} normal block at 0x0000025532FFEC80, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{531} normal block at 0x0000025533009E00, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{526} normal block at 0x000002553300A940, 44 bytes long.
Data: < a 3U > 01 00 00 00 00 00 CD CD 61 A9 00 33 55 02 00 00
{521} normal block at 0x000002553300A080, 44 bytes long.
Data: < 3U > 01 00 00 00 00 00 CD CD A1 A0 00 33 55 02 00 00
Object dump complete.
16:26:14 (3936): wrapper (7.9.26016): starting
16:26:14 (3936): wrapper: running python.exe (run.py)
Windows fix executed.
Detected GPUs: 1
Define environment factory
Define algorithm factory
Define storage factory
Define scheme
Created CWorker with worker_index 0
Created GWorker with worker_index 0
Created UWorker with worker_index 0
Created training scheme.
Define learner
Created Learner.
Look for a progress_last_chk file - if exists, adjust target_env_steps
Define train loop
Traceback (most recent call last):
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 196, in get_data
self.next_batch = self.batches.__next__()
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 471, in <module>
main()
File "run.py", line 136, in main
learner.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\learner.py", line 46, in step
info = self.update_worker.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 118, in step
self.updater.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\updates\u_worker.py", line 259, in step
grads = self.local_worker.step(self.decentralized_update_execution)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 178, in step
self.get_data()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 211, in get_data
self.collector.step()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\gradients\g_worker.py", line 490, in step
rollouts = self.local_worker.collect_data(listen_to=["sync"], data_to_cpu=False)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 168, in collect_data
train_info = self.collect_train_data(listen_to=listen_to)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\scheme\collection\c_worker.py", line 242, in collect_train_data
obs2, reward, done2, episode_infos = self.envs_train.step(clip_act)
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\agent\env\vec_envs\vec_env_base.py", line 85, in step
return self.step_wait()
File "C:\ProgramData\BOINC\slots\16\lib\site-packages\pytorchrl\agent\env\vec_envs\vector_wrappers.py", line 72, in step_wait
obs = torch.from_numpy(obs).float().to(self.device)
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.
19:03:33 (3936): python.exe exited; CPU time 10494.140625
19:03:33 (3936): app exit status: 0x1
19:03:33 (3936): called boinc_finish(195)
0 bytes in 0 Free Blocks.
552 bytes in 9 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 179414097 bytes.
Dumping objects ->
{16455} normal block at 0x000001D92CFBFBE0, 48 bytes long.
Data: <PSI_SCRATCH=C:\P> 50 53 49 5F 53 43 52 41 54 43 48 3D 43 3A 5C 50
{16414} normal block at 0x000001D92CFC08F0, 48 bytes long.
Data: <HOMEPATH=C:\Prog> 48 4F 4D 45 50 41 54 48 3D 43 3A 5C 50 72 6F 67
{16403} normal block at 0x000001D92CFBFF50, 48 bytes long.
Data: <HOME=C:\ProgramD> 48 4F 4D 45 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{16392} normal block at 0x000001D92CFC0790, 48 bytes long.
Data: <TMP=C:\ProgramDa> 54 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44 61
{16381} normal block at 0x000001D92CFC0630, 48 bytes long.
Data: <TEMP=C:\ProgramD> 54 45 4D 50 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{16370} normal block at 0x000001D92CFC0160, 48 bytes long.
Data: <TMPDIR=C:\Progra> 54 4D 50 44 49 52 3D 43 3A 5C 50 72 6F 67 72 61
{16289} normal block at 0x000001D92CF9C280, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {16286} normal block at 0x000001D92CFB20C0, 8 bytes long.
Data: < 8- > 00 00 38 2D D9 01 00 00
{15645} normal block at 0x000001D92CFAE470, 140 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{15033} normal block at 0x000001D92CFB2840, 8 bytes long.
Data: <@ 7- > 40 18 37 2D D9 01 00 00
..\zip\boinc_zip.cpp(122) : {550} normal block at 0x000001D92CF9B820, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{537} normal block at 0x000001D92CFAA2D0, 32 bytes long.
Data: < , P , > B0 A9 FA 2C D9 01 00 00 50 AF FA 2C D9 01 00 00
{536} normal block at 0x000001D92CFC0580, 52 bytes long.
Data: < r > 01 00 00 00 72 00 CD CD 00 00 00 00 00 00 00 00
{531} normal block at 0x000001D92CFAA0F0, 43 bytes long.
Data: < p > 01 00 00 00 70 00 CD CD 00 00 00 00 00 00 00 00
{526} normal block at 0x000001D92CFAAF50, 44 bytes long.
Data: < q , > 01 00 00 00 00 00 CD CD 71 AF FA 2C D9 01 00 00
{521} normal block at 0x000001D92CFAA9B0, 44 bytes long.
Data: < , > 01 00 00 00 00 00 CD CD D1 A9 FA 2C D9 01 00 00
{511} normal block at 0x000001D92CFBBDB0, 16 bytes long.
Data: < , > B0 AE FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{510} normal block at 0x000001D92CFAAEB0, 40 bytes long.
Data: < , input.zi> B0 BD FB 2C D9 01 00 00 69 6E 70 75 74 2E 7A 69
{503} normal block at 0x000001D92CFBCAA0, 16 bytes long.
Data: <h , > 68 F8 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{502} normal block at 0x000001D92CFBCA10, 16 bytes long.
Data: <@ , > 40 F8 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{501} normal block at 0x000001D92CFBCC50, 16 bytes long.
Data: < , > 18 F8 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{500} normal block at 0x000001D92CFBB0C0, 16 bytes long.
Data: < , > F0 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{499} normal block at 0x000001D92CFBC980, 16 bytes long.
Data: < , > C8 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{498} normal block at 0x000001D92CFBAFA0, 16 bytes long.
Data: < , > A0 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{496} normal block at 0x000001D92CFBBD20, 16 bytes long.
Data: <X , > 58 E9 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{495} normal block at 0x000001D92CFAAE10, 32 bytes long.
Data: <username=Compsci> 75 73 65 72 6E 61 6D 65 3D 43 6F 6D 70 73 63 69
{494} normal block at 0x000001D92CFBCBC0, 16 bytes long.
Data: <0 , > 30 E9 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{493} normal block at 0x000001D92CF9C3B0, 64 bytes long.
Data: <PYTHONPATH=.\lib> 50 59 54 48 4F 4E 50 41 54 48 3D 2E 5C 6C 69 62
{492} normal block at 0x000001D92CFBCE90, 16 bytes long.
Data: < , > 08 E9 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{491} normal block at 0x000001D92CFAAFF0, 32 bytes long.
Data: <PATH=.\Library\b> 50 41 54 48 3D 2E 5C 4C 69 62 72 61 72 79 5C 62
{490} normal block at 0x000001D92CFBC350, 16 bytes long.
Data: < , > E0 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{489} normal block at 0x000001D92CFBC1A0, 16 bytes long.
Data: < , > B8 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{488} normal block at 0x000001D92CFBC8F0, 16 bytes long.
Data: < , > 90 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{487} normal block at 0x000001D92CFBB420, 16 bytes long.
Data: <h , > 68 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{486} normal block at 0x000001D92CFBBA50, 16 bytes long.
Data: <@ , > 40 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{485} normal block at 0x000001D92CFBC110, 16 bytes long.
Data: < , > 18 E8 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{484} normal block at 0x000001D92CFAA730, 32 bytes long.
Data: <SystemRoot=C:\Wi> 53 79 73 74 65 6D 52 6F 6F 74 3D 43 3A 5C 57 69
{483} normal block at 0x000001D92CFBC7D0, 16 bytes long.
Data: < , > F0 E7 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{482} normal block at 0x000001D92CFAA370, 32 bytes long.
Data: <GPU_DEVICE_NUM=0> 47 50 55 5F 44 45 56 49 43 45 5F 4E 55 4D 3D 30
{481} normal block at 0x000001D92CFBC6B0, 16 bytes long.
Data: < , > C8 E7 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{480} normal block at 0x000001D92CFAAC30, 32 bytes long.
Data: <NTHREADS=1 THREA> 4E 54 48 52 45 41 44 53 3D 31 00 54 48 52 45 41
{479} normal block at 0x000001D92CFBC620, 16 bytes long.
Data: < , > A0 E7 FA 2C D9 01 00 00 00 00 00 00 00 00 00 00
{478} normal block at 0x000001D92CFAE7A0, 480 bytes long.
Data: < , 0 , > 20 C6 FB 2C D9 01 00 00 30 AC FA 2C D9 01 00 00
{477} normal block at 0x000001D92CFBCE00, 16 bytes long.
Data: < , > 80 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{476} normal block at 0x000001D92CFBCD70, 16 bytes long.
Data: <X , > 58 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{475} normal block at 0x000001D92CFBB780, 16 bytes long.
Data: <0 , > 30 F7 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{474} normal block at 0x000001D92CFAE6F0, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{473} normal block at 0x000001D92CFBC590, 16 bytes long.
Data: <x , > 78 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{472} normal block at 0x000001D92CFBB150, 16 bytes long.
Data: <P , > 50 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{471} normal block at 0x000001D92CFBC500, 16 bytes long.
Data: <( , > 28 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{470} normal block at 0x000001D92CFBB300, 16 bytes long.
Data: < , > 00 F6 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{469} normal block at 0x000001D92CFBCCE0, 16 bytes long.
Data: < , > D8 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{468} normal block at 0x000001D92CFBCB30, 16 bytes long.
Data: < , > B0 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{467} normal block at 0x000001D92CFBC740, 16 bytes long.
Data: < , > 90 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{466} normal block at 0x000001D92CFBBB70, 16 bytes long.
Data: <h , > 68 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{465} normal block at 0x000001D92CFAA690, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{464} normal block at 0x000001D92CFBBAE0, 16 bytes long.
Data: <@ , > 40 F5 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{463} normal block at 0x000001D92CFA8510, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{462} normal block at 0x000001D92CFBC860, 16 bytes long.
Data: < , > 88 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{461} normal block at 0x000001D92CFBB810, 16 bytes long.
Data: <` , > 60 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{460} normal block at 0x000001D92CFBB030, 16 bytes long.
Data: <8 , > 38 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{459} normal block at 0x000001D92CFBC080, 16 bytes long.
Data: < , > 10 F4 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{458} normal block at 0x000001D92CFBB9C0, 16 bytes long.
Data: < , > E8 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{457} normal block at 0x000001D92CFBE000, 16 bytes long.
Data: < , > C0 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{456} normal block at 0x000001D92CFBEB40, 16 bytes long.
Data: < , > A0 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{455} normal block at 0x000001D92CFBDF70, 16 bytes long.
Data: <x , > 78 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{454} normal block at 0x000001D92CFBDAF0, 16 bytes long.
Data: <P , > 50 F3 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{453} normal block at 0x000001D92CFA8460, 48 bytes long.
Data: </C "del pythongp> 2F 43 20 22 64 65 6C 20 70 79 74 68 6F 6E 67 70
{452} normal block at 0x000001D92CFBDEE0, 16 bytes long.
Data: < , > 98 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{451} normal block at 0x000001D92CFBD8B0, 16 bytes long.
Data: <p , > 70 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{450} normal block at 0x000001D92CFBD790, 16 bytes long.
Data: <H , > 48 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{449} normal block at 0x000001D92CFBDE50, 16 bytes long.
Data: < , > 20 F2 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{448} normal block at 0x000001D92CFBEAB0, 16 bytes long.
Data: < , > F8 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{447} normal block at 0x000001D92CFBD9D0, 16 bytes long.
Data: < , > D0 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{446} normal block at 0x000001D92CFBD700, 16 bytes long.
Data: < , > B0 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{445} normal block at 0x000001D92CFBEA20, 16 bytes long.
Data: < , > 88 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{444} normal block at 0x000001D92CFAA4B0, 32 bytes long.
Data: <C:\Windows\syste> 43 3A 5C 57 69 6E 64 6F 77 73 5C 73 79 73 74 65
{443} normal block at 0x000001D92CFBD820, 16 bytes long.
Data: <` , > 60 F1 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{442} normal block at 0x000001D92CFA38F0, 48 bytes long.
Data: <x pythongpu_wind> 78 20 70 79 74 68 6F 6E 67 70 75 5F 77 69 6E 64
{441} normal block at 0x000001D92CFBDA60, 16 bytes long.
Data: < , > A8 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{440} normal block at 0x000001D92CFBE900, 16 bytes long.
Data: < , > 80 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{439} normal block at 0x000001D92CFBE870, 16 bytes long.
Data: <X , > 58 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{438} normal block at 0x000001D92CFBEBD0, 16 bytes long.
Data: <0 , > 30 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{437} normal block at 0x000001D92CFBE6C0, 16 bytes long.
Data: < , > 08 F0 FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{436} normal block at 0x000001D92CFBE480, 16 bytes long.
Data: < , > E0 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{435} normal block at 0x000001D92CFBD4C0, 16 bytes long.
Data: < , > C0 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{434} normal block at 0x000001D92CFBDC10, 16 bytes long.
Data: < , > 98 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{433} normal block at 0x000001D92CFBD430, 16 bytes long.
Data: <p , > 70 EF FB 2C D9 01 00 00 00 00 00 00 00 00 00 00
{432} normal block at 0x000001D92CFBEF70, 2976 bytes long.
Data: <0 , .\7za.ex> 30 D4 FB 2C D9 01 00 00 2E 5C 37 7A 61 2E 65 78
{69} normal block at 0x000001D92CFACC20, 16 bytes long.
Data: < ;* > 80 EA 3B 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{68} normal block at 0x000001D92CFACA70, 16 bytes long.
Data: <@ ;* > 40 E9 3B 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{67} normal block at 0x000001D92CFAC0E0, 16 bytes long.
Data: < W8* > F8 57 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{66} normal block at 0x000001D92CFAC050, 16 bytes long.
Data: < W8* > D8 57 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x000001D92CFAC9E0, 16 bytes long.
Data: <P 8* > 50 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x000001D92CFAC680, 16 bytes long.
Data: <0 8* > 30 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x000001D92CFACB00, 16 bytes long.
Data: < 8* > E0 02 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x000001D92CFAC950, 16 bytes long.
Data: < 8* > 10 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x000001D92CFAC8C0, 16 bytes long.
Data: <p 8* > 70 04 38 2A F6 7F 00 00 00 00 00 00 00 00 00 00
{60} normal block at 0x000001D92CFAC710, 16 bytes long.
Data: < 6* > 18 C0 36 2A F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.
</stderr_txt>
]]>
|
|
|
|
it's right in your message:
"RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes."
that's why.
this is a known problem with the windows app. you need to increase your virtual memory (page file) to like 50GB.
also it looks like your host only has 16GB system RAM. if you're running other things that use lots of memory (like rosetta or einstein GW CPU tasks) then you might be running out of system memory too. these python tasks need about 10GB of system memory for each one.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
I am experiencing a strange problem on my PC with two RTX3070 inside, CPU Intel i9-10900KF (10 cores/20 threads), 128 GB RAM:
until about 2 weeks ago, I crunched 4 Python tasks concurrently (2 ea. GPU).
Then I processed ACEMD_3 and ATM tasks, the queues of which ran dry now.
So I changed back to Python - and surprise: after downloading 4 tasks, only 3 started, the fourth one stays in status "ready to start".
I had made no changes, neither in the hardware, nore in the software, nor in the settings.
Anyone any idea what I can do in order to get the fourth task to run? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Consequence of running the acemd3 and ATM tasks is that it dropped your APR rate on the host and now the client thinks that you will not be able to finish the second Python task before deadline.
You probably have the single Python task in EDF mode now.
Try adding <fraction_done_exact/> into every app section in your app_config.xml
That helps produce more realistic progress percentages and could/may persuade the client to let you run that second task on that gpu.
But you may just have to let the APR mechanism balance out again. One of the many flaws in BOINC. |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
thank you, Keith, for the explanation :-)
<fraction_done_exact/> has been in the app_config to begin with.
So I am afraid I just need to wait ...
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
Absolutely no usage of GPU only CPU.
task 27464783 |
|
|
|
for the first 5 minutes or so, there will only be CPU use and no GPU use because the task is extracting the python environment to the designated slot. after this, the task will run and start using both GPU and CPU. GPU use will be low.
____________
|
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
No, it was not at 5% but 29% and stuck. I exited BOINC and restarted. The WU is now normal at 34%. |
|
|
|
i said 5 minutes not 5%.
but sounds like an issue with your system, not the tasks. my tasks have never gotten stuck like that.
____________
|
|
|
|
It is 20 minutes on my hdd. |
|
|
KAMasudSend message
Joined: 27 Jul 11 Posts: 137 Credit: 523,901,354 RAC: 0 Level
Scientific publications
|
i said 5 minutes not 5%.
but sounds like an issue with your system, not the tasks. my tasks have never gotten stuck like that.
____________--
Chill, bro. Completed and validated. |
|
|
|
It is 20 minutes on my hdd.
that makes sense for a slower device like a HDD.
____________
|
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Is the Python project dead ? |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1340 Credit: 7,652,488,294 RAC: 13,521,734 Level
Scientific publications
|
Is the Python project dead ?
Could be. Haven't seen the researcher behind those task types around for quite a while.
Could be he has moved on or maybe just taking a summer sabbatical or something. |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 7 Credit: 52,127,500 RAC: 208,643 Level
Scientific publications
|
Hi,
I'm receiving the following error message after about 700-800 sec running time:
09:33:09 (32292): Library/usr/bin/tar.exe exited; CPU time 0.000000
09:33:09 (32292): wrapper: running C:/Windows/system32/cmd.exe (/c call Scripts\activate.bat && Scripts\conda-unpack.exe && run.bat)
Could not find platform independent libraries <prefix>
Python path configuration:
PYTHONHOME = (not set)
PYTHONPATH = (not set)
program name = '\\?\D:\ProgramData\BOINC\slots\4\python.exe'
isolated = 0
environment = 1
user site = 1
safe_path = 0
import site = 1
is in build tree = 0
stdlib dir = 'D:\ProgramData\BOINC\slots\4\Lib'
sys._base_executable = '\\\\?\\D:\\ProgramData\\BOINC\\slots\\4\\python.exe'
sys.base_prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.base_exec_prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.platlibdir = 'DLLs'
sys.executable = '\\\\?\\D:\\ProgramData\\BOINC\\slots\\4\\python.exe'
sys.prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.exec_prefix = 'D:\\ProgramData\\BOINC\\slots\\4'
sys.path = [
'D:\\ProgramData\\BOINC\\slots\\4\\python311.zip',
'D:\\ProgramData\\BOINC\\slots\\4\\DLLs',
'D:\\ProgramData\\BOINC\\slots\\4\\Lib',
'\\\\?\\D:\\ProgramData\\BOINC\\slots\\4',
]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'
Current thread 0x000058b0 (most recent call first):
<no Python frame>
09:33:10 (32292): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
09:33:10 (32292): app exit status: 0x1
09:33:10 (32292): called boinc_finish(195)
Any idea why this error happens recently?
Thanks,
Peter |
|
|
Erich56Send message
Joined: 1 Jan 15 Posts: 1132 Credit: 10,205,482,676 RAC: 29,855,510 Level
Scientific publications
|
Hi,
I'm receiving the following error message after about 700-800 sec running time:
...
Any idea why this error happens recently?
when did you receive this task from which you think it's a Python? Pythons have not been around for quite a while - just take a look at the server status page |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 7 Credit: 52,127,500 RAC: 208,643 Level
Scientific publications
|
I think it's a python task because the error message is regarding a python problem:
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'
I got these tasks today and in the recent days:
Task received at in UTC | Computing status text | Runtime | Application name
28 Aug 2024 8:31:51 UTC | Error while computing | 672.11 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 8:03:54 UTC | Error while computing | 703.63 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 7:34:52 UTC | Error while computing | 708.96 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 7:20:30 UTC | Error while computing | 714.93 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 8:17:39 UTC | Error while computing | 709.18 | ATMML: Free energy with neural networks v1.01 (cuda1121)
28 Aug 2024 7:49:20 UTC | Error while computing | 724.49 | ATMML: Free energy with neural networks v1.01 (cuda1121)
27 Aug 2024 9:35:49 UTC | Error while computing | 776.90 | ATMML: Free energy with neural networks v1.01 (cuda1121)
27 Aug 2024 1:24:00 UTC | Error while computing | 60.60 | ATMML: Free energy with neural networks v1.01 (cuda1121)
26 Aug 2024 9:41:56 UTC | Error while computing | 20.18 | ATMML: Free energy with neural networks v1.01 (cuda1121) |
|
|
|
Those describe themselves as ATMML tasks - the clue is in the name.
There's been a major problem with ATMML tasks in the last 24 hours - all workunits created since around 13:00 UTC yesterday have a systemic failure which cause them to fail very early.
That's the project's problem, not your problem. |
|
|
TofPeteSend message
Joined: 17 Mar 24 Posts: 7 Credit: 52,127,500 RAC: 208,643 Level
Scientific publications
|
Thank you
Those describe themselves as ATMML tasks - the clue is in the name.
There's been a major problem with ATMML tasks in the last 24 hours - all workunits created since around 13:00 UTC yesterday have a systemic failure which cause them to fail very early.
That's the project's problem, not your problem.
|
|
|