Advanced search

Message boards : Number crunching : Recently Python app work unit failed with computing error

Author Message
Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 59125 - Posted: 18 Aug 2022 | 11:01:06 UTC
Last modified: 18 Aug 2022 | 11:05:57 UTC

With work unit;
e00009a00848-ABOU_rnd_ppod_expand_demos23-0-1-RND9304_0
It happened:
Error while computing 262,081.10

What does this mean and what does it suggest is causing this error?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 345,202
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59127 - Posted: 18 Aug 2022 | 11:31:56 UTC - in response to Message 59125.

Result 33000731 (easier that way)

It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny.

It's not immediately obvious what ultimately went wrong, but I'll keep looking.

Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 59130 - Posted: 18 Aug 2022 | 17:24:14 UTC - in response to Message 59127.

That's a mobile gpu in a laptop. Not usually recommended to even try running gpu tasks on a laptop.

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 59132 - Posted: 19 Aug 2022 | 0:29:13 UTC - in response to Message 59127.

If you find out what is causing the problem, please let me know about it also.

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 59156 - Posted: 24 Aug 2022 | 9:34:43 UTC - in response to Message 59130.
Last modified: 24 Aug 2022 | 9:35:07 UTC

It looks like overheating and GPU driver kicked out of operation:

Wed Aug 24 17:28:52 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 3267MiB / 4096MiB | 94% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3604 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Wed Aug 24 17:28:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 90C P0 N/A / N/A | 3267MiB / 4096MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1491 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3604 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Wed Aug 24 17:29:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1451
Credit: 3,575,929,351
RAC: 345,202
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59157 - Posted: 24 Aug 2022 | 10:24:44 UTC - in response to Message 59156.

It looks like overheating and GPU driver kicked out of operation:

Putting the laptop on a cooling stand, with fans blowing air directly into the laptop's ventilation inlet slots, may help. The slots are often underneath, but check your particular machine.

And give then a good clean while you're down there!

KAMasud
Send message
Joined: 27 Jul 11
Posts: 59
Credit: 194,307,986
RAC: 281,159
Level
Ile
Scientific publications
watwat
Message 59158 - Posted: 24 Aug 2022 | 12:29:58 UTC

I run laptop GPU's. Take a small air blower to its slots and blow the dust-out. I do it daily.

Profile JohnMD
Avatar
Send message
Joined: 4 Dec 10
Posts: 5
Credit: 3,672,356
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 59168 - Posted: 28 Aug 2022 | 20:05:27 UTC - in response to Message 59127.

Result 33000731 (easier that way)

It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny.

It's not immediately obvious what ultimately went wrong, but I'll keep looking.

Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers.


It is clear that these GPU tasks (MX-series) fail with insufficient GPU memory.
I am unable to find such "requirements" - I can't even find details of applications. Can anyone help ?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 59169 - Posted: 28 Aug 2022 | 20:21:13 UTC - in response to Message 59168.

Result 33000731 (easier that way)

It looks like it's failed and attempted a restart multiple times. The number (262,081.10) is the number of seconds it's wasted doing all that - not a happy bunny.

It's not immediately obvious what ultimately went wrong, but I'll keep looking.

Your GPU (NVIDIA GeForce MX250 (4042MB) driver: 510.73) is unusual, and may be a little short on memory (6 GB is recommended), and I'm not an expert on the Debian drivers.


It is clear that these GPU tasks (MX-series) fail with insufficient GPU memory.
I am unable to find such "requirements" - I can't even find details of applications. Can anyone help ?


I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU.

with the help of a cuda mps server, I can push 2 tasks concurrently in a 6GB GTX 1060 card as some of the memory gets shared.

I would say 3GB minimum needed per task. and at least 4GB to be comfortable.

system memory is also quite high. uses about 8GB system memory per task.

but you should monitor it, the project could change the requirements at any time if they want to run larger jobs or change the direction of their research.

____________

goldfinch
Send message
Joined: 5 May 19
Posts: 14
Credit: 184,864,685
RAC: 286,558
Level
Ile
Scientific publications
wat
Message 59173 - Posted: 29 Aug 2022 | 11:37:33 UTC
Last modified: 29 Aug 2022 | 11:46:50 UTC

I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example,

29/08/2022 9:32:39 PM | GPUGRID | Computation for task e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 finished
29/08/2022 9:32:39 PM | GPUGRID | [task] result state=COMPUTE_ERROR for e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 from CS::app_finished

or another task, with more logging details:
29/08/2022 9:44:40 PM | GPUGRID | [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited, exit code 195, task state 1
29/08/2022 9:44:40 PM | GPUGRID | [task] task_state=EXITED for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from handle_exited_app
29/08/2022 9:44:40 PM | GPUGRID | [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::report_result_error
29/08/2022 9:44:40 PM | GPUGRID | [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited
29/08/2022 9:44:40 PM | GPUGRID | [task] exit code 195 (0xc3): (unknown error)
29/08/2022 9:44:43 PM | GPUGRID | Computation for task e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 finished
29/08/2022 9:44:43 PM | GPUGRID | [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::app_finished

What could be the reason? GPU has enough graphical memory, the laptop has 16 GB - same as the older one. There is enough disk space, and the temperature doesn't rise above 55-60C.

Acemd3 never fail on the newer laptop (at least, i don't remember such failures) and usually finish in under 15 hours. It's only Python, and only recently. Why could that be? What should I enable in the logs to diagnose better?[/quote]

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 59174 - Posted: 29 Aug 2022 | 12:16:32 UTC - in response to Message 59173.

I have 2 laptops, the older one running Nvidia GT 1060, and the newer, running RTX 3060. The older laptop, though reaching GPU temperature of ~90C, completes most of Python tasks, though it may take up to 40 hours. The newer laptop, however, fails vast majority of Python tasks though that started quite recently. For example,

29/08/2022 9:32:39 PM | GPUGRID | Computation for task e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 finished
29/08/2022 9:32:39 PM | GPUGRID | [task] result state=COMPUTE_ERROR for e00008a01599-ABOU_rnd_ppod_expand_demos24_3-0-1-RND7566_0 from CS::app_finished

or another task, with more logging details:
29/08/2022 9:44:40 PM | GPUGRID | [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited, exit code 195, task state 1
29/08/2022 9:44:40 PM | GPUGRID | [task] task_state=EXITED for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from handle_exited_app
29/08/2022 9:44:40 PM | GPUGRID | [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::report_result_error
29/08/2022 9:44:40 PM | GPUGRID | [task] Process for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 exited
29/08/2022 9:44:40 PM | GPUGRID | [task] exit code 195 (0xc3): (unknown error)
29/08/2022 9:44:43 PM | GPUGRID | Computation for task e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 finished
29/08/2022 9:44:43 PM | GPUGRID | [task] result state=COMPUTE_ERROR for e00014a00897-ABOU_rnd_ppod_expand_demos24_3-0-1-RND0029_3 from CS::app_finished

What could be the reason? GPU has enough graphical memory, the laptop has 16 GB - same as the older one. There is enough disk space, and the temperature doesn't rise above 55-60C.

Acemd3 never fail on the newer laptop (at least, i don't remember such failures) and usually finish in under 15 hours. It's only Python, and only recently. Why could that be? What should I enable in the logs to diagnose better?


this is the more specific error you are getting on that system:
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution


and I've also seen this in your errors:
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes.


researching your first error, and considering the second error, it's likely that memory is your problem. are you trying to run multiple tasks at a time? if not, others with Windows systems have mentioned increasing the pagefile size to solve issues. have you done that as well?
____________

goldfinch
Send message
Joined: 5 May 19
Posts: 14
Credit: 184,864,685
RAC: 286,558
Level
Ile
Scientific publications
wat
Message 59175 - Posted: 29 Aug 2022 | 21:52:41 UTC - in response to Message 59174.

Thank you so much! Well, i was suspecting the memory, but both laptops have Windows-managed pagefile which is set to 40+GB. Is that enough, or should I increase it even more on the newer system? Besides, if it's Windows-managed, doesn't the pagefile size increase automatically on demand, though it may cause issues to the processes that request memory allocation? Anyway, I set the pagefile to 64GB now, as someone did in a similar situation. I expect it to be enough for the WUs to complete without issues. Thanks again.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 59176 - Posted: 30 Aug 2022 | 1:34:08 UTC

The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes.

It probably responds initially to the initiating Boinc wrapper app, but that then loads the Python libraries which have huge memory footprints on Windows.

So the pagefile might not be large enough at the time that python dependent libraries are requesting lots of memory allocation space.

I have been recommending that Windows user just set a custom static sized pagefile of 32GBmin - 64GBmax size and that seems to cover the Python application and tasks and tasks complete successfully.

But with your 64GB current size, you probably have resolved the issue.

Luigi R.
Send message
Joined: 15 Jul 14
Posts: 5
Credit: 53,289,148
RAC: 450
Level
Thr
Scientific publications
watwatwatwat
Message 59177 - Posted: 30 Aug 2022 | 6:59:55 UTC - in response to Message 59169.

I can't speak for Windows behavior, but the last several Linux tasks I processed use a little more than 3GB per task when running 1 task per GPU.

with the help of a cuda mps server, I can push 2 tasks concurrently in a 6GB GTX 1060 card as some of the memory gets shared.

I would say 3GB minimum needed per task. and at least 4GB to be comfortable.

system memory is also quite high. uses about 8GB system memory per task.

but you should monitor it, the project could change the requirements at any time if they want to run larger jobs or change the direction of their research.

On Linux I still can successfully run Python tasks on a GTX 1060 3GB.
Using Intel graphics for display helps.
I noticed Python app's memory usage decreased by ~100MiB too when I moved Xorg process to Intel GPU. I don't exactly know why.

Tue Aug 30 08:22:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 28% 60C P2 47W / 60W | 2740MiB / 3019MiB | 95% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1672379 C bin/python 2731MiB |
+-----------------------------------------------------------------------------+


Before I switched to Intel graphics, it was about:
Xorg 169 MiB
xfwm4 2 MiB
Python 2835 MiB
If I'm not wrong.

goldfinch
Send message
Joined: 5 May 19
Posts: 14
Credit: 184,864,685
RAC: 286,558
Level
Ile
Scientific publications
wat
Message 59179 - Posted: 31 Aug 2022 | 1:50:14 UTC - in response to Message 59176.

The problem with Windows managed pagefile is that it probably doesn't increase its size immediately when the Python application starts loading all its spawned processes.

It probably responds initially to the initiating Boinc wrapper app, but that then loads the Python libraries which have huge memory footprints on Windows.

So the pagefile might not be large enough at the time that python dependent libraries are requesting lots of memory allocation space.

I have been recommending that Windows user just set a custom static sized pagefile of 32GBmin - 64GBmax size and that seems to cover the Python application and tasks and tasks complete successfully.

But with your 64GB current size, you probably have resolved the issue.

Thanks for the details. I was thinking about setting the min size of the pagefile, but then decided to avoid system hickups during pagefile size increase. After all, that's windows (:
Since then my 2nd laptop started processing tasks without issues. Once again, thanks to all who pointed in the right direction.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 72
Credit: 913,982,022
RAC: 57,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59189 - Posted: 3 Sep 2022 | 0:17:11 UTC

Can anyone insight to the reason of these WU crashes?

https://www.gpugrid.net/result.php?resultid=33020923
https://www.gpugrid.net/result.php?resultid=33021416
https://www.gpugrid.net/result.php?resultid=33020419

I couldn't find any understandable clues for myself : /

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 59190 - Posted: 3 Sep 2022 | 0:45:40 UTC - in response to Message 59189.
Last modified: 3 Sep 2022 | 0:46:07 UTC

Can anyone insight to the reason of these WU crashes?

https://www.gpugrid.net/result.php?resultid=33020923
https://www.gpugrid.net/result.php?resultid=33021416
https://www.gpugrid.net/result.php?resultid=33020419

I couldn't find any understandable clues for myself : /


Look at the error below the traceback section.

First one:
ValueError: Object arrays cannot be loaded when allow_pickle=False


Second and third one:
BrokenPipeError: [WinError 232] The pipe is being closed


In at least two of these cases, other people who ran the same WUs also had errors. Likely to just be a problem with the task itself and not your system.
____________

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 72
Credit: 913,982,022
RAC: 57,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59191 - Posted: 3 Sep 2022 | 1:44:34 UTC - in response to Message 59190.


In at least two of these cases, other people who ran the same WUs also had errors. Likely to just be a problem with the task itself and not your system.

Thanks a lot!

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 72
Credit: 913,982,022
RAC: 57,086
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59194 - Posted: 5 Sep 2022 | 9:13:17 UTC

There are more and more problem WUs -_-

https://www.gpugrid.net/result.php?resultid=33022242

BrokenPipeError: [WinError 232] The pipe is being closed

https://www.gpugrid.net/result.php?resultid=33022703
BrokenPipeError: [WinError 109] The pipe has been ended

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,040,000,871
RAC: 1,015,809
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59195 - Posted: 5 Sep 2022 | 19:16:20 UTC - in response to Message 59194.

I have been seeing a number of the same failures for quite some time as well.

https://www.gpugrid.net/result.php?resultid=33024302
https://www.gpugrid.net/result.php?resultid=33024910
https://www.gpugrid.net/result.php?resultid=33025156

It works out to about a 26% error rate.

There really isn't a good indication of why these are failing but I would expect it's a python error or inherent in the simulations that don't work.

The scientists will need to figure out these issues and let us know.


Jeff

mrchips
Send message
Joined: 9 May 21
Posts: 6
Credit: 6,820,500
RAC: 2,411
Level
Ser
Scientific publications
wat
Message 59572 - Posted: 9 Nov 2022 | 12:22:03 UTC

My tasks are still failing with computational errors.
How long before the bas WU get out of the queue?
____________

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,040,000,871
RAC: 1,015,809
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 59574 - Posted: 10 Nov 2022 | 23:54:40 UTC - in response to Message 59572.

The WU's that you have processed are failing with a CUDA error. Your GPU driver version 512.15 is somewhat outdated. I would suggest updating to the latest driver version 526.86.

Download the driver directly from NVIDIA https://www.nvidia.com/en-us/drivers/results/194176/

I would suggest you fully deinstall the driver using the DDU utility. https://www.guru3d.com/files-get/display-driver-uninstaller-download,9.html

Post to thread

Message boards : Number crunching : Recently Python app work unit failed with computing error

//