Advanced search

Message boards : Number crunching : GPU-Utilization low (& variating)

Author Message
Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58902 - Posted: 11 Jun 2022 | 17:36:19 UTC

Sun Jun 12 01:33:47 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 69C P0 N/A / N/A | 3267MiB / 4096MiB | 55% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:33:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 68C P0 N/A / N/A | 3267MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:34:07 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 67C P0 N/A / N/A | 3267MiB / 4096MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:34:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 66C P0 N/A / N/A | 3267MiB / 4096MiB | 11% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+
Sun Jun 12 01:34:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 69C P0 N/A / N/A | 3267MiB / 4096MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 12881 C bin/python 3261MiB |
+-----------------------------------------------------------------------------+


In different samples the GPU-utilization shows low, even below 10% and
GPU-utilization is variating.
Is this normal?
Some other applications were using NVIDIA at 100% GPU-utilization (like collatz).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58904 - Posted: 11 Jun 2022 | 18:56:07 UTC - in response to Message 58902.

yes this is normal for the python application.
____________

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58912 - Posted: 12 Jun 2022 | 7:35:06 UTC - in response to Message 58904.

Then with another application (acemd3) the GPU utilization is continuously 100%, but it looks like that the run times would be 5-6 days for the work unit:

Sun Jun 12 15:34:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 71C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2666 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 40431 C bin/acemd3 174MiB |
+-----------------------------------------------------------------------------+

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 744
Credit: 4,943,798,494
RAC: 524,854
Level
Arg
Scientific publications
wat
Message 58914 - Posted: 12 Jun 2022 | 12:56:34 UTC - in response to Message 58912.

yes it's normal. acemd3 has full utilization. python has low/intermittent utilization.
____________

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58938 - Posted: 18 Jun 2022 | 3:13:04 UTC - in response to Message 58914.
Last modified: 18 Jun 2022 | 3:13:46 UTC

It looks like the GPU/driver is kicked out during running acemd:

Sat Jun 18 11:10:09 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2745 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 5433 C bin/acemd3 174MiB |
+-----------------------------------------------------------------------------+
Sat Jun 18 11:10:14 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+


Is it overheating / lower power related?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 58939 - Posted: 18 Jun 2022 | 6:38:48 UTC

Looks like the card fell off the bus.

Have you looked at the system logs to determine why?

Normally you can find the reason for memory corruption in the system logs and for communications issues in the dmesg logs.

What about backleveling to the stable 510 drivers?

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58940 - Posted: 18 Jun 2022 | 6:59:01 UTC - in response to Message 58939.
Last modified: 18 Jun 2022 | 7:01:10 UTC

Jun 18 11:02:31 mx kernel: [130120.250658] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2
Jun 18 11:02:31 mx kernel: [130120.250663] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 18 11:02:31 mx kernel: [130120.250665] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Jun 18 11:02:31 mx kernel: [130120.250687] NVRM: A GPU crash dump has been created. If possible, please run
Jun 18 11:03:58 mx kernel: [ 7.229407] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
Jun 18 11:10:14 mx kernel: [ 396.557669] NVRM: GPU at PCI:0000:02:00: GPU-793994bc-1295-4395-dc48-7dd3d7b431e2
Jun 18 11:10:14 mx kernel: [ 396.557674] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Jun 18 11:10:14 mx kernel: [ 396.557676] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
Jun 18 11:10:14 mx kernel: [ 396.557697] NVRM: A GPU crash dump has been created. If possible, please run
Jun 18 20:11:43 mx kernel: [ 6.301529] [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver



Yes, it says like it has fallen of from the bus.

It could be difficult to downgrade hte driver version.
Not sure why this occurs. Maybe also Linux driver consumes higher power than Windows driver.

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58941 - Posted: 18 Jun 2022 | 7:58:34 UTC - in response to Message 58939.

I would think it is overheating related, as per when it starts up
and temperature is below 92C it will not crash.
It crashes quite soon after 93C is reached.

Sat Jun 18 15:56:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| N/A 93C P0 N/A / N/A | 180MiB / 4096MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 15623 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 17273 C bin/acemd3 174MiB |
+-----------------------------------------------------------------------------+
Sat Jun 18 15:56:54 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... ERR! | 00000000:02:00.0 Off | N/A |
|ERR! ERR! ERR! N/A / ERR! | GPU is lost | ERR! ERR! |
| | | ERR! |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 58942 - Posted: 18 Jun 2022 | 18:20:26 UTC

92° C. I believe it the standard throttle temp for the current Nvidia cards.

Why is the card so hot is the question.

Are the fans on the card not ramping up in speed to accommodate the higher temps under acemd3 compute load?

Looking at your hosts and the cards listed I assume you are using a laptop since I see a mobile MX variant listed. Have you tried one of the laptop cooling pads with builtin fans to assist the laptop's cooling capabilities?

Have you overclocked the card? Have you set the card fan speeds to 100%

Reduce the core clock and memory clock speeds if possible but unlikely due to the laptop.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 520
Credit: 2,275,558,465
RAC: 227
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58945 - Posted: 18 Jun 2022 | 22:11:39 UTC - in response to Message 58941.

I would think it is overheating related, as per when it starts up
and temperature is below 92C it will not crash.
It crashes quite soon after 93C is reached.

I agree that the problem is most likely produced by overheating.
GPU detaching from bus when reaching 93 ºC could be caused by two reasons:
- A hard-coded GPU self-protection mechanism actuating.
- An electromechanical problem due to soldering getting its melting point at any GPU pin, causing it to miss a good electrical conductivity (this being very dangerous in the long run!).

IMHO, running ACEMD3 / ACEMD4 tasks, due to its high optimization to squeeze the maximum power from the GPU, should be limited (if anything) to very well refrigerated laptops.
Previous wise advices from Keith Myers could be useful.
Python tasks are currently less GPU power demanding, making them more appropriate to run at requirements-complying laptops.
Running certain apps or not, can be selected at Project preferences page.

Additionally, I've specifically treated laptop overheating problems at:
-Message #52937
-Message #57435

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58946 - Posted: 19 Jun 2022 | 3:46:31 UTC - in response to Message 58945.
Last modified: 19 Jun 2022 | 3:46:43 UTC

There is probably very little to do if the laptop cooling is not good enough
for this high GPU load.
There is maybe one fan on top of the CPU inside this laptop and sometype of heatpipe from the GPU to the CPU fan area, but there is not separate fan for the GPU.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1070
Credit: 1,450,990,714
RAC: 426,047
Level
Met
Scientific publications
watwatwatwatwat
Message 58947 - Posted: 19 Jun 2022 | 17:54:38 UTC - in response to Message 58946.

Deselect the acemd3 job tasks. They use all of a gpu to its max capabilities and will overwhelm the weak heatpipe laptop cooling solutions.

Only solution is to try out one of the laptop cooling solutions. Definitely elevate the laptop off its table or resting place to get airflow through the intake vents and out the side or back vents. Assisted airflow to the intake vents is needed.

Try opening the laptop so the two sides make a V and and stand the laptop up vertically on the V. I saw mining operations using that method for mining on laptops last year in pictures. That forms a natural chimney cooling effect.

Or change to the Python on GPU tasks if you have at least 32GB of memory and lots of hard drive space. They use the gpu only occassionally in quick bursts so do not get hot at all.

If you have neither, best to move on to other cpu and gpu projects as your hardware is insufficient to run GPUGrid.

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58954 - Posted: 21 Jun 2022 | 11:16:06 UTC - in response to Message 58947.

Also based to below discussion they suggest the laptop is with "defective GPU":
https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-000000-0-unknown-error/214227/8

Jari Kosonen
Send message
Joined: 5 May 22
Posts: 22
Credit: 8,923,305
RAC: 779
Level
Ser
Scientific publications
wat
Message 58955 - Posted: 21 Jun 2022 | 15:42:35 UTC - in response to Message 58947.

It looks like the GPU is taking more power than the 90W charger adapter can provide and battery level is going down even the laptop is connected in the charger.
Originally adapter was 65W, but even 90W adapter case the battery level is draining slowly.

jjch
Send message
Joined: 10 Nov 13
Posts: 91
Credit: 15,040,000,871
RAC: 1,015,809
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 58959 - Posted: 23 Jun 2022 | 0:24:10 UTC

You may still need a higher power charger. Depending on the laptop the next step up is probably 130W.

That being said, I would hate for you to fry your laptop computing for science.
While it's nice to contribute to society it's not so fun to repair or replace your system.

There are some things you can do to maximize cooling and reduce heat and power draw if you haven't already done them.

Make sure your laptop cooling system is completely clean. No dust bunnies in the fan, heatsink or other vents. Carefully use a vacuum and/or canned air.

Also, make sure you are not blocking any of the ventilation with the surface you are setting it on or other items. It really should be on a cooling pad if you can get one.

It helps if the ambient temperature is reasonable. The cooler the better but it doesn't have to be freezing. If you are sweating your laptop is cooking itself.

If you are inclined to do so you may be able to refresh the thermal paste with something better. Caution not to do this unless you know how to get your laptop apart and back together properly.

Another thing you might try is downclocking the GPU and maybe the CPU to reduce the heat generation and power draw. If you don't already have one there are a number of programs available to do this.

I hope a few of these things may be helpful for you.

Post to thread

Message boards : Number crunching : GPU-Utilization low (& variating)

//