Author |
Message |
|
why is using 2 nvidia gtx1080 cards a problem ? i only compute 1 wu even when 2 wu are send, python is using more than 30% capacity of the 3900X processor
if i leave 2 wu to work 1 of them " hangs " at 4%
cc-cofig is set to
<cc_config>
<options>
<use_all_gpus>1</use_all_gpus>
</options>
</cc_config>
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1376 Credit: 8,057,443,420 RAC: 6,093,044 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
It's because these tasks are primarily cpu tasks, with small infrequent bursts of gpu activity.
The reason your tasks fail is because you are using Windows which has limitations.
Your tasks fail with this error message.
DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.
You need to increase your paging file to around 60GB and you should be able to process two tasks concurrently.
They will use almost all of your cpu.
Please read through the main thread for these tasks for the reason why.
https://www.gpugrid.net/forum_thread.php?id=5233 |
|
|
|
hi Keith
thx for your reply > i changed the page file in w11 from automatic to 60000 and
at first computing of 2 gpu's went fine, both crashed after 4% progress
32 Gb ram is available
i noticed that windows does not allocate the 60000 MB after instruction to do so
the allocation seems to variate
now and before i run 1 task and stop the second before 4% when it wants to start
all the tasks perform succesful running 1 task |
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1376 Credit: 8,057,443,420 RAC: 6,093,044 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
The 3900X host is still erroring out with not enough memory in the stderr.txt outputs.
You need to bump the pagefile up some more. Try 100000MB. I'm assuming your storage space actually has that much free space for that size of file.
I don't know much about Windows but maybe you need to restart Windows for the paging file change to be in effect. |
|
|
|
Keith
i have now 2 pagefiles of 100000 MB and watched in taskmanager 2 WU start, both will run python with about 20% processor capacity until 1 WU disappears and it stops at 4% progress
stderr.txt outputs is not found in W11
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1376 Credit: 8,057,443,420 RAC: 6,093,044 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
Keith
i have now 2 pagefiles of 100000 MB and watched in taskmanager 2 WU start, both will run python with about 20% processor capacity until 1 WU disappears and it stops at 4% progress
stderr.txt outputs is not found in W11
The stderr.txt output is the result file listed on every returned task on the website. You can examine every task in your browser here.
Just click on the task detail number in the left-most column.
For example your latest errored task:
https://www.gpugrid.net/result.php?resultid=33155788
This looks like a bad task however and failed first because it couldn't get all its requireed file resources. But then it failed later as usual because of not enough virtual memory.
Error loading "C:\ProgramData\BOINC\slots\39\lib\site-packages\torch\lib\shm.dll" or one of its dependencies
DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.
Maybe some Windows user can help further. I am out of suggestions. When I have helped other Windows users by explaining why these task are troublesome for the Windows OS and offered the same suggestion to increase the pagefile size, the user has become successful.
I suggest returning to the main thread I linked and read through it or other Windows users posts and maybe glean some other pertinent information.
|
|
|
jjchSend message
Joined: 10 Nov 13 Posts: 101 Credit: 15,760,461,122 RAC: 1,904,415 Level
![Tryptophan - More than 10B credit - Honorary cruncher Trp](img/badges/aa/badge_trp.png) Scientific publications
![Top 90% (1094th/1283) contribution to Doerr et al. JCTC 2014 wat](img/badges/papers/badge_pub_bronze.png) ![Top 10% (253rd/2838) contribution to Stanley et al, Nat Commun 2014 wat](img/badges/papers/badge_pub_emerald.png) ![Top 25% (627th/3183) contribution to Lauro et al., JCIM 2014 wat](img/badges/papers/badge_pub_ruby.png) ![Top 25% (376th/3611) contribution to Ferruz et al., JCIM 2015 wat](img/badges/papers/badge_pub_ruby.png) ![Top 1% (27th/4128) contribution to Ferruz et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (377th/4815) contribution to Stanley et al., Sci Rep 2016 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (4th/4730) contribution to Noe et al., Nat Chem 2017 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (4th/1348) contribution to Doerr et al, JCTC 2017 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (5th/4634) contribution to Martinez-Rosell et al, JCIM 2018 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (5th/1656) contribution to Kapoor et al., Sci Rep 2017 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 1% (4th/1885) contribution to Ferruz et al., Sci Rep 2018 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (8th/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (7th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_sapphire.png) ![Top 10% (48th/1450) contribution to Herrera-Nieto et al, Sci Rep 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 1% (5th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_sapphire.png) |
Matthias,
First thing, simplify your troubleshooting. Only configure one Python task to run. After you get that working successfully then try adding the 2nd one. Go back to just one and see how that works. You can monitor things for one and see what the sizing looks like. If you are running other projects along with GPUgrid you should stop those and get them cleared off.
Second, you probably don't need 2 page files. That could actually be complicating things. Setup one page file on your primary OS disk. Select Custom size and set the Initial and Max size. For example with one Python task running mine is set 24576 and 51200. You can also see how much is currently allocated and that will be helpful to find out where your resources are limited. Mine currently says 48535 MB with one running. Remove the 2nd page file too.
The stderr.txt files are located in the slots directory wherever your BOINC Program Data folder is. You need to find the slot folder for the GPUGRID task by viewing the Properties. Once you open that you will find the file. Take a look at that when a job is running and see what it says. When a job is running correctly it should say "Created Learner" and it will stay there for several hours until it finishes or fails.
BOINC can also be a little touchy when it comes to how much disk space and memory it is allocating. This could actually be related to your problem. The default settings don't always work right for what you need. First look at the Disk tab and see what the Total disk usage looks like. Pay attention to the free, available to BOINC size. Then look at what GPUgrid is using. If you are running other projects you will need to compare the total size to what is available and make sure it is enough for everything.
You can make changes to these settings under Options > Computing preferences > Disk and memory tab. For the Disk section look and see if it is giving you enough disk space. You might only need 3-4 GB more so make an adjustment there as needed. You can lock it to a fixed size if you would like to do that too. Also, under the Memory section the "When computer is (not) in use ..." might need to be increased a bit. Make sure the Page/swap file setting is 100.00%
Final thoughts. I don't know how successful Win11 is for GPUgrid yet. There possibly could be other issues there. Recommend that you tune it up the best you can as well. Check for Windows updates, update GPU drivers, clean disk space etc. Don't run a lot of other programs at the same time you are running GPUgrid either. There could be a conflict of resources there too. GLHF |
|
|
gemini8![Avatar](user_profile/images/372740_avatar.jpg) Send message
Joined: 3 Jul 16 Posts: 31 Credit: 2,234,559,169 RAC: 60,076 Level
![Phenylalanine - More than 2B credits Phe](img/badges/aa/badge_phe.png) Scientific publications
![Top 90% (1159th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_bronze.png) ![Top 10% (309th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) |
[...]
Select Custom size and set the Initial and Max size. For example with one Python task running mine is set 24576 and 51200.
[...]
Make sure the Page/swap file setting is 100.00%
I'd go for a fixed size page file.
Just set it to 51200 or whatever on inital AND max size. Thus, the space is always reserved, and adding more space fast enough can't become a problem.
My page file setting is 1% on all my machines. This is including Debian, Mac OS, Ubuntu and Win7 crunchers. No problems with that so far.
I don't know too much about fragmentation on recent Windows machines. My Macs defragmentage themselves quite nicely, so Windows might be able to do so as well nowadays, and the next thing might not be necessary anymore: If the page file is on a rotating disk, try to disable it, start anew, defragment you drive, then enable the page file to the size you want.
____________
- - - - - - - - - -
Greetings, Jens |
|
|
|
hi Keith and jjch and germini8
the funny mysterious is, it works today for both (2) WU's, no crash
pagingfile is now 81845 MB allocated by W11 ( i fixed the size but windows ignore's )
jjch . GLHF is funny > i looked it up : good luck have fun, thanks for your sharing
i have fun running boinc for years now and i need to keep up buying new (faster and more core's) hardware
we could/ should ask some people to return to contributing to gpugrid, don't know why they stopped ( python ? )
thanks again all ~ Matthias-Poortvliet-Netherlands
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1376 Credit: 8,057,443,420 RAC: 6,093,044 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
I just became a pioneer with arrows in my back.
Just upgraded one host to a new AM5 platform with a 7950X cpu and DDR5-6000 memory.
Lots of stuff to figure out now. Like absolutely no sensors are available in Linux except ffor the gpus and NVME stick temps. No fan speeds, no temps, no voltages are available.
Too new a platform for Ubuntu 22.04.1 LTS. |
|
|
|
my planned upgrade within a few weeks is on AM4 5950X
i'm collecting second hand hardware when i can
and Keith ~ arrows > you 're not dead yet
i think AM5 is a bit over the top > energy usage / efficiency
yet i wish you GLHF
(:-)
[/img]
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1376 Credit: 8,057,443,420 RAC: 6,093,044 Level
![Tyrosine - More than 5B credits Tyr](img/badges/aa/badge_tyr.png) Scientific publications
![Top 10% (64th/1022) contribution to Wang et al., ACS Cent. Sci. 2019 wat](img/badges/papers/badge_pub_emerald.png) ![Top 50% (273rd/672) contribution to Martinez-Rosell et al, JCIM 2020 wat](img/badges/papers/badge_pub_gold.png) ![Top 75% (1040th/1541) contribution to Rodriguez-Espigares et al., Nat Meth 2020 wat](img/badges/papers/badge_pub_silver.png) ![Top 10% (413th/6232) contribution to Herrera-Nieto et al, JCIM 2020 wat](img/badges/papers/badge_pub_emerald.png) ![Top 100% (294th/315) contribution to Cossu et al, JCIM 2020 wat](img/badges/papers/badge_pub_white.png) |
So far no difference in energy usage or temps.
Benefit of being able to run my PCIE Gen.4 cards at Gen.4 speeds now.
Benefit of having Gen. 4 M.2 speeds with a Gen. 4 device for storage now.
Benefit of running cpu tasks at 800-1000Mhz faster than previously on the 5950X.
Some projects cpu tasks scale linearly just with clock speeds.
Haven't run any projects that can make use of AVX-512 SIMD instructions yet. |
|
|