Docker, Docker Compose & Nvidia GPUs
Listing GPU processes in a cuda container; Restricting CUDA devices in Docker Compose
Context
Using cuda containers have been very easy thanks to NVIDIA/nvidia-container-toolkit, which superseded the now deprecated NVIDIA/nvidia-container-runtime, and whose install is very simple Installing the NVIDIA Container Toolkit.
Now, say you have correctly installed everything, and correctly run an nvidia docker container. Is there any behaviour that might differ? Well, unfortunately, to this date, yes. Though, it could be considered a minor issue, nvidia-smi
and even parts of nvml
break a bit inside an nvidia container.
Secondly, say you want to run an nvidia container as a service through Docker Compose. Is there any way to potentially restrict the GPUs that the service/container will be able to access?
All the tests used here are available in the following repo:
To run all the tests and get the logs in ./logs
:
make log-all-tests
The associated docker images are available on Dockerhub: DH rr5555/gpu_access_test, but can be remade & modified in needed using the Dockerfile present and the associated cmds in the Makefile (which indicates which Dockerfile corresponds to which image tag). These are just nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04
variations with some python packages installed.
Before delving into the details and the tests, please note that:
For the control experiment, for the
runtime: nvidia
arg to be accepted:’’
- Docker compose setup for a host machine
Modify daemon.json file at /etc/docker/daemon.json. If the file does not exist, it should be created. File with no additional config would look like below:
{ "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
’’
nvidia-smi
& nvml
in an nvidia container
Problem
Running an nvidia container, and from inside the container, executing:
nvidia-smi
Will result in the GPUs diplaying, but no processes being listed, as can be verified in the test.
What is the cause and is there any solution?
Search
-
No processes display when I using nvidia-smi. #759 => Frequently Asked Questions - Why is nvidia-smi inside the container not listing the running processes?: not accessible anymore
- Is it correct that “nvidia-smi” on Docker does not show “Processes”?:
’’ Yes, you will not be able to see, due to driver not being aware of PID namespace. You can peruse the thread and the work-around using Python in particular, at
How to show processes in container with cmd nvidia-smi? #179 ‘’
’’ A shim driver allows in-docker nvidia-smi showing correct process list without modify anything.
matpool/mpu ‘’
What is a shim?
'’In computer programming, a shim is a library that transparently intercepts API calls and changes the arguments passed, handles the operation itself or redirects the operation elsewhere.[1][2] Shims can be used to support an old API in a newer environment, or a new API in an older environment. Shims can also be used for running programs on different software platforms than they were developed for.
Shims for older APIs typically come about when the behavior of an API changes, thereby causing compatibility issues for older applications which still rely on the older functionality; in such cases, the older API can still be supported by a thin compatibility layer on top of the newer code. Shims for newer APIs are defined as: “a library that brings a new API to an older environment, using only the means of that environment.”[3] ‘‘Shim (computing) -
How to show processes in container with cmd nvidia-smi? #179
’’ Oh sorry, I guess I missed the topic on this one a bit.
Anyway, if using python is an option for you:
import os import sys import pynvml as N MB = 1024 * 1024 def get_usage(device_index, my_pid): N.nvmlInit() handle = N.nvmlDeviceGetHandleByIndex(device_index) usage = [nv_process.usedGpuMemory // MB for nv_process in N.nvmlDeviceGetComputeRunningProcesses(handle) + N.nvmlDeviceGetGraphicsRunningProcesses(handle) if nv_process.pid == my_pid] if len(usage) == 1: usage = usage[0] else: raise KeyError("PID not found") return usage if __name__ == "__main__": print(get_usage(sys.argv[1], sys.argv[2]))
Instead of calling nvidia-smi from your process you could just call this little script with the device-index of your GPU and the processe’s PID as arguments using popen or anything else and reading from stdout. ‘’
’’ FYI: As a workaround I start the docker container with the –pid=host flag which works perfectly fine with e.g. python’s os.getpid(). ‘’
Dirty work-around:
docker run --pid=host
’’ Adding hostPID might not be a good idea for k8s clusters in which the GPU workers are shared among multiple tenants.
The users can potentially kill processes that belong to others if they are root in the container and they can see all the pids on the host.
Still hope that this can be patched at driver level to add PID namespace perception. ‘’
’’ options available to you, I am closing this:
- add hostPID: true to pod spec
- for docker (rather than Kubernetes) run with –privileged or –pid=host. This is useful if you need to run nvidia-smi manually as an admin for troubleshooting.
- set up MIG partitions on a supported card
Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue)
If there is a specific feature or enhancement to one of the options above, please open a new issue. ‘’
’’ I stumbled across this ticket while looking for a way to get information in the Kubernetes cluster about which pod/container is using which NVIDIA GPU, even if the pod is not running in hostPID mode. I have developed a tool to get this information. In case it will be useful for someone:
https://github.com/dmrub/kube-utils/blob/master/kube-nvidia-get-processes.py
Note that the tool runs a privileged pod on each node with GPU to map the node PID to the container PID.
Here is the sample output:
NODE POD NAMESPACE NODE_PID PID GPU GPU_NAME PCI_ADDRESS GPU_MEMORY CMDLINE proj-vm-1-node-02 pytorch-nvidia-794d7bb8d-mb7qp jodo01 27301 24927 0 Tesla V100-PCIE-32GB 00000000:00:07.0 813 MiB /opt/conda/bin/python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-9a103c15-db05-4427-a850-cc79412cb4e8.json proj-vm-1-node-02 pytorch-nvidia-794d7bb8d-mb7qp jodo01 7744 25151 0 Tesla V100-PCIE-32GB 00000000:00:07.0 5705 MiB /opt/conda/bin/python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-4049593b-313b-444f-8188-a33875108b10.json
’’
’’ I made a kernel module that patches the nvidia driver for correct PID namespace handling: https://github.com/gh2o/nvidia-pidns ‘’
- container PID namespace isolation with NVML #63
’’ We’d like to deploy NVML-based monitoring tools to each task container, providing GPU information for ML engineers to take performance analysis.
However, if the PID namespace of the task container is isolated from the host machine’s, we found that, even deployed within the container, the NVML (func
nvmlDeviceGetComputeRunningProcesses
) gives the PID(s) on the host machine. That makes the following info processing difficult because only the container PID namespace is visible to users (ML engineers).Is there any solution to overcome this pid namespace isolation? Or does NVML has any plan to extend
nvmlDeviceGetComputeRunningProcesses
so that it can return pid in the container PID namespace? ‘’’’ Another instance here, where we are using Gitlab CI/CD to launch Docker containers (which have access to the host’s NVML library). These fail to inspect their own process info because of PID namespace mismatches, and there’s no easy way to launch those containers with
--pid=host
. ‘’
’’ You can put
pid: "host"
in yourdocker-compose.yml
file to make it work. It’s documented here.
pid: "host"
Sets the PID mode to the host PID mode. This turns on sharing between container and the host operating system the PID address space. Containers launched with this flag can access and manipulate other containers in the bare-metal machine’s namespace and vice versa.
’’
’’ These functions should help:
>>> import torch >>> torch.cuda.is_available() True >>> torch.cuda.device_count() 1 >>> torch.cuda.current_device() 0 >>> torch.cuda.device(0) <torch.cuda.device at 0x7efce0b03be0> >>> torch.cuda.get_device_name(0) 'GeForce GTX 950M'
This tells us:
- CUDA is available and can be used by one device.
- Device 0 refers to the GPU GeForce GTX 950M, and it is currently chosen by PyTorch. ‘’
Tests
The tests for that part are run from the repo by:
make nvidia-smi-dummy-load-tests > ./logs/test_nvidia-smi_dummy-load.log
(Or as a part of make log-all-tests
)
It basically run two containers using Docker Compose showing the:
- Basic behaviour: RR5555/compose-gpu_access_restriction/nvidia-smi/compose_testGPUAccess_dummy_load.yml
- Behaviour when using
pid: host
(similar todocker run --pid=host
): RR5555/compose-gpu_access_restriction/nvidia-smi/compose_testGPUAccess_dummy_load_pid_host.yml
This translates into the following yml:name: gpu_access_test services: test: image: rr5555/gpu_access_test:test_both runtime: nvidia pid: host deploy: restart_policy: condition: on-failure volumes: - ./code:/code entrypoint: > bash -c "python3 /code/base_fct.py & sleep 1; ps -aux | grep -i python3; nvidia-smi; sleep 5"
Please read the Search section and/or the Take-aways section before considering using
pid: host
.
The container used here rr5555/gpu_access_test:test_both is simply a cuda12 ubuntu22.04 img with pip3, pytorch and pynvml installed (see the associated Dockerfile).
They both run the following entrypoint:
python3 /code/base_fct.py & sleep 1; ps -aux | grep -i python3; nvidia-smi; sleep 5
Where the main of RR5555/compose-gpu_access_restriction/nvidia-smi/code/base_fct.py is simply loading a tensor to one device and then run a fct to parse the pids of running programs obtained using pynvml
, and try to get the associated cmd from ps aux
.
The output (slightly edited to improve readability) of the test can be found below. It should be noted that the python is launched, then the sanity check ps -aux | grep -i python3
is displayed, followed by nvidia-smi
and lastly, the python script finishes: extracting pid of GPU running programs, and trying to find out from ps aux
its corresponding cmd.
Output of the test:
Testing nvidia-smi processes:
...
root 1 0.0 0.0 4364 2688 ? Ss 11:19 0:00 bash -c python3 /code/base_fct.py & ps -aux | grep -i python3; nvidia-smi; sleep 5
root 7 0.0 0.0 9916 896 ? D 11:19 0:00 python3 /code/base_fct.py
root 9 0.0 0.0 3472 1344 ? S 11:19 0:00 grep -i python3
Sun Apr 21 11:19:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 23% 27C P8 9W / 250W | 29MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 6MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN X (Pascal) Off | 00000000:81:00.0 Off | N/A |
| 23% 25C P8 10W / 250W | 6MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN V Off | 00000000:82:00.0 Off | N/A |
| 28% 33C P8 24W / 250W | 5MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
'GPU:0\nno processes are running'
GPU:0
no processes are running
['no processes are running']
no processes are running
{}
'GPU:1\nno processes are running'
GPU:1
no processes are running
['no processes are running']
no processes are running
{}
'GPU:2\nno processes are running'
GPU:2
no processes are running
['no processes are running']
no processes are running
{}
'GPU:3\nprocess 284964 uses 326.000 MB GPU memory'
GPU:3
process 284964 uses 326.000 MB GPU memory
['process 284964 uses 326.000 MB GPU memory']
284964
{'284964': []}
...
Testing nvidia-smi processes with pid host:
...
root 1401 0.0 0.0 42344 19712 ? Ss Apr20 0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root 1565 0.0 0.0 119444 23296 ? Ssl Apr20 0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root 285927 3.0 0.0 4364 2688 ? Ss 11:19 0:00 bash -c python3 /code/base_fct.py & sleep 1; ps -aux | grep -i python3; nvidia-smi; sleep 5
root 285983 102 0.4 2911560 309120 ? R 11:19 0:01 python3 /code/base_fct.py
root 286004 0.0 0.0 3472 1792 ? S 11:20 0:00 grep -i python3
Sun Apr 21 11:20:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 23% 27C P8 8W / 250W | 29MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 23% 24C P8 9W / 250W | 6MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN X (Pascal) Off | 00000000:81:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 6MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN V Off | 00000000:82:00.0 Off | N/A |
| 28% 33C P2 25W / 250W | 5MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2313 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 3072 G /usr/bin/gnome-shell 8MiB |
| 1 N/A N/A 2313 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2313 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2313 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
'GPU:0\nno processes are running'
GPU:0
no processes are running
['no processes are running']
no processes are running
{}
'GPU:1\nno processes are running'
GPU:1
no processes are running
['no processes are running']
no processes are running
{}
'GPU:2\nno processes are running'
GPU:2
no processes are running
['no processes are running']
no processes are running
{}
'GPU:3\nprocess 285983 uses 326.000 MB GPU memory'
GPU:3
process 285983 uses 326.000 MB GPU memory
['process 285983 uses 326.000 MB GPU memory']
285983
{'285983': ['root 285983 80.0 0.7 20696168 479892 ? Sl 11:19 0:01 python3 /code/base_fct.py\n']}
Please note the presence of a process with pid 285983
in ps
part of Testing nvidia-smi processes with pid host:
which corresponds to the program: python3 /code/base_fct.py
. Whereas 284964
in Testing nvidia-smi processes:
was nowhere to be found in the container, and python3 /code/base_fct.py
had pid 7
inside the container but could not be directly identified by NVML.
Take-aways
Without any option modification, you check GPU utilization in particular processes running on it using NVML
, and in the case of python pynvml
which is just a wrapper for NVML. However, the returned PIDs are nowhere to be found inside the container.
One could also Pytorch where a lot of functions are available (see PyTorch Docs – torch.cuda):
torch.cuda.is_available() torch.cuda.device_count() torch.cuda.current_device() torch.cuda.device(0) torch.cuda.get_device_name(0) torch.cuda.get_device_properties(0) torch.cuda.memory_usage(0) torch.cuda.utilization(0) torch.cuda.list_gpu_processes(0)
However, the more advanced function
torch.cuda.list_gpu_processes
actually requirespynvml
(see torch.cuda.list_gpu_processes SOURCE).
From the Search, the issue stems from the separation between the host PID namespace and the container PID namespace. This is actually part of the isolation benefits provided by containerization. Indeed, we can see in the proposed solutions and in the tests, that sharing the PID namespace of the host to the container fixes nvidia-smi
along with a shared pid for the program on the GPU, and thus, allowing us to find the host pid that NVML was getting in the container PID namespace.
Using
pid: host
(or--pid=host
) could allow the container to interfere with host processes which is usually not desirable.
More specifically, it would actually be a flaw in the Nvidia driver when launched from inside the container:
’’ The NVIDIA driver is not aware of the PID namespace and nvidia-smi has no capability to map global pid to virtual pid, thus it shows nothing. ‘’
Some people have apparently provided fixes for that flaw by directly correcting the driver:
- GH matpool/mpu
- GH gh2o/nvidia-pidns
I have neither tried nor understood yet the fixes that are proposed in the links above. matpool/mpu provides a detailed explanation of the process that he was able to reconstruct, hereby making it possible to compare the tentative process with the proposed fix. However, the details on how he arrived at the tentative process are not given, only an overall high-level list of the steps. Tracking, for myself, the detailed steps will be a mission for another day.
General wisdom: Only run code that you trust. (Though, in practice, some compromises are made, and vetting is often left to others hopefully more competent in the case of open-source. But the recent XZ issue still serves as a cautionary tale.)
or by using an additional pod (in the specific case of k8s):
https://github.com/dmrub/kube-utils/blob/master/kube-nvidia-get-processes.py
Note that the tool runs a privileged pod on each node with GPU to map the node PID to the container PID.
privileged
mode is another topic, but lest to say that, it also comes with security risks, and thus, is not enabled by default. It is however necessary in some cases/applications. More on that when I will post about Docker-in-Docker (DinD).
For now, no official fix have been implemented. But I think that matpool’s or gh2o’s fixes could be an answer if trusted and/or verified. But until I have the time to or someone dive into these potential fixes, for me at least, the problem will remain open.
CUDA devices in Docker Compose
Problem
The problem is simple, how to restrict access to GPUs in a container/service in Docker Compose? This is mainly to make sure that services will not interfere with GPUs that are assigned for other purposes, that is considering ressource management, and not considering ill-intent.
The typical use case here would be to have a container only having access to one specific GPU out of several available.
Search
-
Limit GPU usage in nvidia-docker?
’’ You can try it with nvidia-docker-compose
version: "2" services process1: image: nvidia/cuda devices: - /dev/nvidia0
’’
’’
There are 3 options.
Docker with NVIDIA RUNTIME (version 2.0.x)
According to official documentation
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=2,3
nvidia-docker (version 1.0.x)
based on a popular post
nvidia-docker run .... -e CUDA_VISIBLE_DEVICES=0,1,2
(it works with tensorflow) programmatically
import os os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2"
’’
’’ Access specific devices
To allow access only to GPU-0 and GPU-3 devices:
services: test: image: tensorflow/tensorflow:latest-gpu command: python -c "import tensorflow as tf;tf.test.gpu_device_name()" deploy: resources: reservations: devices: - driver: nvidia device_ids: ['0', '3'] capabilities: [gpu]
’’
Tests
The tests for that part are run from repo by:
make all-nvidia-smi-tests > ./logs/test_nvidia-smi.log
make all-pytorch-tests > ./logs/test_pytorch.log
make all-pynvml-tests > ./logs/test_pynvml.log
(Or as a part of make log-all-tests
)
I basically made compose.yml
files:
version: "0.1"
name: gpu_access_test
services:
test:
image: rr5555/gpu_access_test:test_pytorch
runtime: nvidia
deploy:
restart_policy:
condition: on-failure
entrypoint: >
bash -c "printf 'import torch\\nprint(f\"torch.cuda.is_available():{torch.cuda.is_available()}\")\\nprint(f\"torch.cuda.device_count():{torch.cuda.device_count()}\")' | python3"
each with a different configuration regarding the cuda devices declaration:
-
device_ids
:services: test: image: rr5555/gpu_access_test:test_pytorch deploy: restart_policy: condition: on-failure resources: reservations: devices: - driver: nvidia device_ids: - "0" capabilities: [gpu]
-
GPU_ID
:services: test: image: rr5555/gpu_access_test:test_pytorch deploy: restart_policy: condition: on-failure resources: reservations: devices: - driver: nvidia capabilities: [gpu] environment: GPU_ID: 0
-
CUDA_VISIBLE_DEVICES
:services: test: image: rr5555/gpu_access_test:test_pytorch deploy: restart_policy: condition: on-failure resources: reservations: devices: - driver: nvidia capabilities: [gpu] environment: CUDA_VISIBLE_DEVICES: 0
-
NVIDIA_VISIBLE_DEVICES
:services: test: image: rr5555/gpu_access_test:test_pytorch deploy: restart_policy: condition: on-failure resources: reservations: devices: - driver: nvidia capabilities: [gpu] environment: NVIDIA_VISIBLE_DEVICES: 0
and three test versions:
- one using
pytorch
:entrypoint: > bash -c "printf 'import torch\\nprint(f\"torch.cuda.is_available():{torch.cuda.is_available()}\")\\nprint(f\"torch.cuda.device_count():{torch.cuda.device_count()}\")' | python3; NVIDIA_VISIBLE_DEVICES=all printf 'import torch\\nprint(f\"torch.cuda.is_available():{torch.cuda.is_available()}\")\\nprint(f\"torch.cuda.device_count():{torch.cuda.device_count()}\")' | python3"
- one using
pynvml
:entrypoint: > bash -c "printf 'from pynvml import *\\nnvmlInit()\\nprint(\"Driver Version:\", nvmlSystemGetDriverVersion())\\ndeviceCount = nvmlDeviceGetCount()\\nfor i in range(deviceCount):\\n\\thandle = nvmlDeviceGetHandleByIndex(i)\\n\\tprint(\"Device\", i, \":\", nvmlDeviceGetName(handle))\\nnvmlShutdown()' | python3; NVIDIA_VISIBLE_DEVICES=all printf 'from pynvml import *\\nnvmlInit()\\nprint(\"Driver Version:\", nvmlSystemGetDriverVersion())\\ndeviceCount = nvmlDeviceGetCount()\\nfor i in range(deviceCount):\\n\\thandle = nvmlDeviceGetHandleByIndex(i)\\n\\tprint(\"Device\", i, \":\", nvmlDeviceGetName(handle))\\nnvmlShutdown()' | python3"
- on using
nvidia-smi
:entrypoint: > bash -c "nvidia-smi; echo 'Env. reverse back:'; NVIDIA_VISIBLE_DEVICES=all nvidia-smi"
Where we just check what GPUs are detected with each tool/library, and then rerun the test after setting the environment variable back to all
inside the container when applicable, to make sure that the previous set variable was to be taken into account for setting the service/container, and not just setting the environment variable inside the container. (The tests do indeed indicates the former, that it is taken into account to set the container, and not merely, having these values in the shell of the container.)
NVIDIA_VISIBLE_DEVICES=all
, see Specialized Configurations with Docker
GPU_IDS=all
, see Running Redact Enterprise Onpremise – GPU_IDS
You can find the three docker imgs that I made and used on DockerHub:
- rr5555/gpu_access_test:test_pytorch: basically a cuda12 ubuntu22.04 img with pip3 and pytorch installed (see the associated Dockerfile)
- rr5555/gpu_access_test:test_pynvml: basically a cuda12 ubuntu22.04 img with pip3 and pynvml installed (see the associated Dockerfile)
- rr5555/gpu_access_test:test_nvidia_smi: just a cuda12 ubuntu22.04 img (see the associated Dockerfile)
The edited results of the tests can be found in the following table:
pytorch | pynvml | nvidia-smi | |
---|---|---|---|
Control | torch.cuda.is_available():True torch.cuda.device_count():4 | Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) Device 1 : NVIDIA TITAN X (Pascal) Device 2 : NVIDIA TITAN X (Pascal) Device 3 : NVIDIA TITAN V | 0 NVIDIA TITAN X (Pascal) 1 NVIDIA TITAN X (Pascal) 2 NVIDIA TITAN X (Pascal) 3 NVIDIA TITAN V |
device_ids | torch.cuda.is_available():True torch.cuda.device_count():1 | Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) | 0 NVIDIA TITAN X (Pascal) |
GPU_ID | torch.cuda.is_available():True torch.cuda.device_count():4 | Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) Device 1 : NVIDIA TITAN X (Pascal) Device 2 : NVIDIA TITAN X (Pascal) Device 3 : NVIDIA TITAN V | 0 NVIDIA TITAN X (Pascal) 1 NVIDIA TITAN X (Pascal) 2 NVIDIA TITAN X (Pascal) 3 NVIDIA TITAN V |
CUDA_VISIBLE_DEVICES | torch.cuda.is_available():True torch.cuda.device_count():1 | Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) Device 1 : NVIDIA TITAN X (Pascal) Device 2 : NVIDIA TITAN X (Pascal) Device 3 : NVIDIA TITAN V | 0 NVIDIA TITAN X (Pascal) 1 NVIDIA TITAN X (Pascal) 2 NVIDIA TITAN X (Pascal) 3 NVIDIA TITAN V |
NVIDIA_VISIBLE_DEVICES | torch.cuda.is_available():True torch.cuda.device_count():1 | Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) | 0 NVIDIA TITAN X (Pascal) |
(I also tested with GPU_IDS
, just in case, but only for nvidia-smi
, the results were similar to those of GPU_ID
.)
Take-aways
It then comes that, to restrict access to GPUs, the most consistent ways are: device_ids
and NVIDIA_VISIBLE_DEVICES
.
CUDA_VISIBLE_DEVICES
does not have a consistent behaviour across pytorch
, pynvml
and nvidia-smi
.
Please do not hesitate to leave a comment below or dm me if you have any feedback, whether it is about a mistake I made in the articles or suggestions for improvements.
Other articles: