Advanced search

Message boards : Server and website : python tasks get to 2.00% and hang

Author Message
Sanford
Send message
Joined: 23 Nov 09
Posts: 5
Credit: 382,298,193
RAC: 225
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58899 - Posted: 11 Jun 2022 | 4:50:16 UTC

Remaining estimate is over 45 days on an 8th-gen Intel system with a nVidia 1060 GPU. Also see some crash popups mentioning Python, but the task in BOINC still shows Active, but never seems to progress. Have suspended and even rebooted, but task still stuck at 2%. Perhaps my mix of software (Android development/emulators, etc that use VT-d modes) is causing problems? I guess it would be nice if you could roll everything up into an .EXE without needing to run a VirtualBox VM.

captainjack
Send message
Joined: 9 May 13
Posts: 170
Credit: 1,273,603,256
RAC: 85,220
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58903 - Posted: 11 Jun 2022 | 18:53:25 UTC

AFAIK, GPUGRID does not use a VirtualBox VM. At lease it doesn't on either of my machines.

Remaining estimate always starts out really high until several tasks are completed successfully, then it will start being more accurate.

A Python task will use about 65% of available CPU threads. For example, on my 6 core 12 thread CPU, Python tasks will use 7-8 threads.

On all of your tasks that I checked, it shows an error message of "OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\BOINC\slots\13\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
Traceback (most recent call last):"

Suggest that you increase the size of your paging file, run a python task by itself and watch system usage, run the python task through to completion without interruption.

Let us know if that helps.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 981
Credit: 1,330,904,414
RAC: 1,484,419
Level
Met
Scientific publications
watwatwatwatwat
Message 58908 - Posted: 11 Jun 2022 | 20:56:12 UTC

The problem is that Windows handles the request for reserving memory without question. And the Python tasks request a ton of memory, more than what the automatically sized paging file can handle in Windows.

Solution is to deselect automatic sizing and either set system managed size or Custom size and set a very large value on the order of tens of gigabytes.

From a good Github reply on the problem that concisely sums up the issue.


The issue is with how multi-process Python works on Windows with the pytorch/cuda DLLs. The number of workers you set in the DataLoader directly relates to how many Python processes are created.

Each time a Python process imports pytorch it loads several DLLs. These DLLs have very large sections of data in them that aren't really used, but space is reserved for them in memory anyways. We're talking in the range of hundreds of megabytes to a couple gigabytes, per DLL.

When Windows is asked to reserve memory, if it says that it returned memory then it guarantees that memory will be available to you, even if you never end up using it.

Linux allows overcommitting. By default on Linux, when you ask it to reserve memory, it says "Yeah sure, here you go" and tells you that it reserved the memory. But it hasn't actually done this. It will reserve it when you try to use it, and hopes that there is something available at that time.

So, if you allocate memory on Windows, you can be sure you can use that memory. If you allocate memory on Linux, it is possible that when you actually try to use the memory that it will not be there, and your program will crash.

On Linux, when it spawns num_workers processes and each one reserves several gigabytes of data, Linux is happy to say it reserved this, even though it didn't. Since this "reserved memory" is never actually used, everything is good. You can create tons of worker processes. Just because pytorch allocated 50GB of memory, as long as it never actually uses it it won't be a problem. (Note: I haven't actually ran pytorch on Linux. I am just describing how Linux would not have this crash even if it attempted to allocate the same amount of memory. I do not know for a fact that pytorch/CUDA overallocate on Linux)

On Windows, when you spawn num_workers processes and each one reserves several gigabytes of data, Windows insists that it can actually satisfy this request should the memory be used. So, if Python tries to allocate 50GB of memory, then your total RAM + page file size must have space for 50GB.

So, on Windows NumPythonProcesses*MemoryPerProcess < RAM + PageFileSize must be true or you will hit this error.


The number of workers for the Python on GPU tasks is 32 spawned workers.

So the equation is going to be 32 * MemoryPerProess < RAM +PageFileSize

And many of the PyTorch DLL's request a couple of GB's of memory allocation each.

This is why it is difficult to run the tasks on gpus because the system memory + pagefile size is most often inadequate. The pagefile size needs to be greatly increased. And to do that means you need a large piece of storage real estate for the pagefile.

Post to thread

Message boards : Server and website : python tasks get to 2.00% and hang

//