python tasks get to 2.00% and hang

Message boards : Server and website : python tasks get to 2.00% and hang

Author	Message
Sanford Send message Joined: 23 Nov 09 Posts: 5 Credit: 382,298,193 RAC: 0 Level Scientific publications	Message 58899 - Posted: 11 Jun 2022 \| 4:50:16 UTC
	Remaining estimate is over 45 days on an 8th-gen Intel system with a nVidia 1060 GPU. Also see some crash popups mentioning Python, but the task in BOINC still shows Active, but never seems to progress. Have suspended and even rebooted, but task still stuck at 2%. Perhaps my mix of software (Android development/emulators, etc that use VT-d modes) is causing problems? I guess it would be nice if you could roll everything up into an .EXE without needing to run a VirtualBox VM.
	ID: 58899 \| Rating: 0 \| rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 2,592,346,788 RAC: 200,707 Level Scientific publications	Message 58903 - Posted: 11 Jun 2022 \| 18:53:25 UTC
	AFAIK, GPUGRID does not use a VirtualBox VM. At lease it doesn't on either of my machines. Remaining estimate always starts out really high until several tasks are completed successfully, then it will start being more accurate. A Python task will use about 65% of available CPU threads. For example, on my 6 core 12 thread CPU, Python tasks will use 7-8 threads. On all of your tasks that I checked, it shows an error message of "OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\13\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies. Traceback (most recent call last):" Suggest that you increase the size of your paging file, run a python task by itself and watch system usage, run the python task through to completion without interruption. Let us know if that helps.
	ID: 58903 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 58908 - Posted: 11 Jun 2022 \| 20:56:12 UTC
	The problem is that Windows handles the request for reserving memory without question. And the Python tasks request a ton of memory, more than what the automatically sized paging file can handle in Windows. Solution is to deselect automatic sizing and either set system managed size or Custom size and set a very large value on the order of tens of gigabytes. From a good Github reply on the problem that concisely sums up the issue. The issue is with how multi-process Python works on Windows with the pytorch/cuda DLLs. The number of workers you set in the DataLoader directly relates to how many Python processes are created. Each time a Python process imports pytorch it loads several DLLs. These DLLs have very large sections of data in them that aren't really used, but space is reserved for them in memory anyways. We're talking in the range of hundreds of megabytes to a couple gigabytes, per DLL. When Windows is asked to reserve memory, if it says that it returned memory then it guarantees that memory will be available to you, even if you never end up using it. Linux allows overcommitting. By default on Linux, when you ask it to reserve memory, it says "Yeah sure, here you go" and tells you that it reserved the memory. But it hasn't actually done this. It will reserve it when you try to use it, and hopes that there is something available at that time. So, if you allocate memory on Windows, you can be sure you can use that memory. If you allocate memory on Linux, it is possible that when you actually try to use the memory that it will not be there, and your program will crash. On Linux, when it spawns num_workers processes and each one reserves several gigabytes of data, Linux is happy to say it reserved this, even though it didn't. Since this "reserved memory" is never actually used, everything is good. You can create tons of worker processes. Just because pytorch allocated 50GB of memory, as long as it never actually uses it it won't be a problem. (Note: I haven't actually ran pytorch on Linux. I am just describing how Linux would not have this crash even if it attempted to allocate the same amount of memory. I do not know for a fact that pytorch/CUDA overallocate on Linux) On Windows, when you spawn num_workers processes and each one reserves several gigabytes of data, Windows insists that it can actually satisfy this request should the memory be used. So, if Python tries to allocate 50GB of memory, then your total RAM + page file size must have space for 50GB. So, on Windows NumPythonProcessesMemoryPerProcess < RAM + PageFileSize must be true or you will hit this error. The number of workers for the Python on GPU tasks is 32 spawned workers. So the equation is going to be 32 MemoryPerProess < RAM +PageFileSize And many of the PyTorch DLL's request a couple of GB's of memory allocation each. This is why it is difficult to run the tasks on gpus because the system memory + pagefile size is most often inadequate. The pagefile size needs to be greatly increased. And to do that means you need a large piece of storage real estate for the pagefile.
	ID: 58908 \| Rating: 0 \| rate: / Reply Quote

lukeu Send message Joined: 14 Oct 11 Posts: 31 Credit: 75,720,504 RAC: 0 Level Scientific publications	Message 59273 - Posted: 19 Sep 2022 \| 9:21:47 UTC - in response to Message 58908.
	Hi & thanks for the clear info. I was just rejoining to crunch for the winter season, and first thought (after frantically increasing swap-size) was indeed: "is something assuming Linux-style overcommit?" Out of curiosity I also ran up VMMap, which led me to this SO article which did the same: https://stackoverflow.com/a/69489193/932359 The frustrating bit is it seems to just be an incorrect flag set by nVidea for their embedded "fat binaries": Setting copy-on-write means Window's rigorous memory accounting needs to commit space for each instance. If it's not space for _data_ then it should be read-only to be shared (memory-mapped) between all processes. It reads to me that this probably includes the binary code for all the various GPUs we don't own! :-D Notable quote: edit 2022-01-20: Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ." I take it, just from the filenames, that GPUGrid is using CUDA 10.x? (Their DLLs don't seem to embed version info.) (Problem for me is I'm trying to use an old 64GB SSD as a scratch disk for all swap & temp files. But not to worry, I'll see if I can scrape by with 48 GB & will add another swapfile on another disk if need be.)
	ID: 59273 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,014,417,459 RAC: 9,835,999 Level Scientific publications	Message 59276 - Posted: 19 Sep 2022 \| 23:52:56 UTC - in response to Message 59273.
	The Nvidia drivers are already up to CUDA 11.7 in the 515 series. So maybe you can ping the developer abouh and see whether he can drop the lower compatibility CUDA 10.2 and 11.3 versions he is compiling the Windows apps with and move to the CUDA 11.7 SDK so that the nv_fatb sections in the DLL's will be marked read-only.
	ID: 59276 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 59277 - Posted: 20 Sep 2022 \| 0:29:00 UTC - in response to Message 59273.
	I take it, just from the filenames, that GPUGrid is using CUDA 10.x? (Their DLLs don't seem to embed version info.) the app is CUDA 11.3.1 (11.3 Update 1) or CUDA 10.2. which version you get probably depends on which drivers you have. since your 512 drivers support 11.3+ you got the cuda1131 app. can see it here in your list of tasks: http://www.gpugrid.net/results.php?hostid=470942 ____________
	ID: 59277 \| Rating: 0 \| rate: / Reply Quote

lukeu Send message Joined: 14 Oct 11 Posts: 31 Credit: 75,720,504 RAC: 0 Level Scientific publications	Message 59278 - Posted: 20 Sep 2022 \| 6:45:35 UTC - in response to Message 59276.
	The Nvidia drivers are already up to CUDA 11.7 in the 515 series. (...) Ah, right. I didn't realise this would also put a minimum version constraint on the local gfx drivers. My one was at 512.15 from Mar-2022 via Windows automatic updates. (I'm currently downloading the latest to manually install.) I guess this catches users between: - "a rock": runs with larger than expected memory consumption, and - "a hard place": doesn't run at all due to driver incompatibility So on sum, it's probably premature for them to jump to 11.7 right now: "the rock" probably works out of the box for more users. Maybe once a sufficient version has comes through Windows Update then it could be time to consider it.
	ID: 59278 \| Rating: 0 \| rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1048 Credit: 40,228,383,983 RAC: 547,485 Level Scientific publications	Message 59284 - Posted: 20 Sep 2022 \| 14:47:49 UTC - in response to Message 59278.
	The Nvidia drivers are already up to CUDA 11.7 in the 515 series. (...) Ah, right. I didn't realise this would also put a minimum version constraint on the local gfx drivers. My one was at 512.15 from Mar-2022 via Windows automatic updates. (I'm currently downloading the latest to manually install.) I guess this catches users between: - "a rock": runs with larger than expected memory consumption, and - "a hard place": doesn't run at all due to driver incompatibility So on sum, it's probably premature for them to jump to 11.7 right now: "the rock" probably works out of the box for more users. Maybe once a sufficient version has comes through Windows Update then it could be time to consider it. since your 1060 supports old drivers, you could backdate the drivers to 10.2+ and get the 10.2 app, and maybe this will use less memory without having any cuda 11+ code and no bins for cuda 11 cards. ____________
	ID: 59284 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Server and website : python tasks get to 2.00% and hang

	About	Science	Volunteers	Performance	Forum	Join us	Donate