Context
nvidia-smi & nvml in an nvidia container
- Problem
- Search
- Tests
- Take-aways
CUDA devices in Docker Compose
- Problem
- Search
- Tests
- Take-aways

Context

Using cuda containers have been very easy thanks to NVIDIA/nvidia-container-toolkit, which superseded the now deprecated NVIDIA/nvidia-container-runtime, and whose install is very simple Installing the NVIDIA Container Toolkit.

Now, say you have correctly installed everything, and correctly run an nvidia docker container. Is there any behaviour that might differ? Well, unfortunately, to this date, yes. Though, it could be considered a minor issue, nvidia-smi and even parts of nvml break a bit inside an nvidia container.

Secondly, say you want to run an nvidia container as a service through Docker Compose. Is there any way to potentially restrict the GPUs that the service/container will be able to access?

All the tests used here are available in the following repo:

To run all the tests and get the logs in ./logs:

make log-all-tests

The associated docker images are available on Dockerhub: DH rr5555/gpu_access_test, but can be remade & modified in needed using the Dockerfile present and the associated cmds in the Makefile (which indicates which Dockerfile corresponds to which image tag). These are just nvidia/cuda:12.1.0-cudnn8-runtime-ubuntu22.04 variations with some python packages installed.

Before delving into the details and the tests, please note that:

For the control experiment, for the runtime: nvidia arg to be accepted:
From Enabling GPUs with NVIDIA Docker Container Runtime
’’

Docker compose setup for a host machine

Modify daemon.json file at /etc/docker/daemon.json. If the file does not exist, it should be created. File with no additional config would look like below:
{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
’’

`nvidia-smi` & `nvml` in an nvidia container

Problem

Running an nvidia container, and from inside the container, executing:

nvidia-smi

Will result in the GPUs diplaying, but no processes being listed, as can be verified in the test.

What is the cause and is there any solution?

Search

No processes display when I using nvidia-smi. #759 => Frequently Asked Questions - Why is nvidia-smi inside the container not listing the running processes?: not accessible anymore
Is it correct that “nvidia-smi” on Docker does not show “Processes”?:
From Prabindh’s answer
’’ Yes, you will not be able to see, due to driver not being aware of PID namespace. You can peruse the thread and the work-around using Python in particular, at

How to show processes in container with cmd nvidia-smi? #179 ‘’

From zionfuo’s answer
’’ A shim driver allows in-docker nvidia-smi showing correct process list without modify anything.

matpool/mpu ‘’

What is a shim?

From matpool/mpu
'’In computer programming, a shim is a library that transparently intercepts API calls and changes the arguments passed, handles the operation itself or redirects the operation elsewhere.[1][2] Shims can be used to support an old API in a newer environment, or a new API in an older environment. Shims can also be used for running programs on different software platforms than they were developed for.
Shims for older APIs typically come about when the behavior of an API changes, thereby causing compatibility issues for older applications which still rely on the older functionality; in such cases, the older API can still be supported by a thin compatibility layer on top of the newer code. Shims for newer APIs are defined as: “a library that brings a new API to an older environment, using only the means of that environment.”[3] ‘‘Shim (computing)
How to show processes in container with cmd nvidia-smi? #179
From How to show processes in container with cmd nvidia-smi? #179 – maaft’s answer
’’ Oh sorry, I guess I missed the topic on this one a bit.

Anyway, if using python is an option for you:
```
import os
import sys
import pynvml as N

MB = 1024 * 1024

def get_usage(device_index, my_pid):
    N.nvmlInit()

    handle = N.nvmlDeviceGetHandleByIndex(device_index)

    usage = [nv_process.usedGpuMemory // MB for nv_process in
             N.nvmlDeviceGetComputeRunningProcesses(handle) + N.nvmlDeviceGetGraphicsRunningProcesses(handle) if
             nv_process.pid == my_pid]

    if len(usage) == 1:
        usage = usage[0]
    else:
        raise KeyError("PID not found")

    return usage

if __name__ == "__main__":
   print(get_usage(sys.argv[1], sys.argv[2]))
```
Instead of calling nvidia-smi from your process you could just call this little script with the device-index of your GPU and the processe’s PID as arguments using popen or anything else and reading from stdout. ‘’
From maaft’s answer
’’ FYI: As a workaround I start the docker container with the –pid=host flag which works perfectly fine with e.g. python’s os.getpid(). ‘’

Dirty work-around:
```
docker run --pid=host
```
From pandaji’s answer
’’ Adding hostPID might not be a good idea for k8s clusters in which the GPU workers are shared among multiple tenants.

The users can potentially kill processes that belong to others if they are root in the container and they can see all the pids on the host.

Still hope that this can be patched at driver level to add PID namespace perception. ‘’
From nvjmayo’s answer
’’ options available to you, I am closing this:
1. add hostPID: true to pod spec
2. for docker (rather than Kubernetes) run with –privileged or –pid=host. This is useful if you need to run nvidia-smi manually as an admin for troubleshooting.
3. set up MIG partitions on a supported card
Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue)

If there is a specific feature or enhancement to one of the options above, please open a new issue. ‘’
From dmrub’s answer
’’ I stumbled across this ticket while looking for a way to get information in the Kubernetes cluster about which pod/container is using which NVIDIA GPU, even if the pod is not running in hostPID mode. I have developed a tool to get this information. In case it will be useful for someone:

https://github.com/dmrub/kube-utils/blob/master/kube-nvidia-get-processes.py

Note that the tool runs a privileged pod on each node with GPU to map the node PID to the container PID.

Here is the sample output:
```
NODE              POD                            NAMESPACE NODE_PID PID   GPU GPU_NAME             PCI_ADDRESS      GPU_MEMORY CMDLINE                                                                                                                             
proj-vm-1-node-02 pytorch-nvidia-794d7bb8d-mb7qp jodo01    27301    24927 0   Tesla V100-PCIE-32GB 00000000:00:07.0 813 MiB    /opt/conda/bin/python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-9a103c15-db05-4427-a850-cc79412cb4e8.json  
proj-vm-1-node-02 pytorch-nvidia-794d7bb8d-mb7qp jodo01    7744     25151 0   Tesla V100-PCIE-32GB 00000000:00:07.0 5705 MiB   /opt/conda/bin/python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-4049593b-313b-444f-8188-a33875108b10.json  
```
’’
From gh2o’s answer
’’ I made a kernel module that patches the nvidia driver for correct PID namespace handling: https://github.com/gh2o/nvidia-pidns ‘’
container PID namespace isolation with NVML #63
From zw0610’s answer
’’ We’d like to deploy NVML-based monitoring tools to each task container, providing GPU information for ML engineers to take performance analysis.

However, if the PID namespace of the task container is isolated from the host machine’s, we found that, even deployed within the container, the NVML (func nvmlDeviceGetComputeRunningProcesses) gives the PID(s) on the host machine. That makes the following info processing difficult because only the container PID namespace is visible to users (ML engineers).

Is there any solution to overcome this pid namespace isolation? Or does NVML has any plan to extend nvmlDeviceGetComputeRunningProcesses so that it can return pid in the container PID namespace? ‘’

From maleadt’s answer
’’ Another instance here, where we are using Gitlab CI/CD to launch Docker containers (which have access to the host’s NVML library). These fail to inspect their own process info because of PID namespace mismatches, and there’s no easy way to launch those containers with --pid=host. ‘’

From –pid=host To Set through DockerFile – 7_R3X’s answer
’’ You can put pid: "host" in your docker-compose.yml file to make it work. It’s documented here.

pid: "host"

Sets the PID mode to the host PID mode. This turns on sharing between container and the host operating system the PID address space. Containers launched with this flag can access and manipulate other containers in the bare-metal machine’s namespace and vice versa.

’’

From How do I check if PyTorch is using the GPU? – vvvvv’s answer
’’ These functions should help:
>>> import torch

>>> torch.cuda.is_available()
True

>>> torch.cuda.device_count()
1

>>> torch.cuda.current_device()
0

>>> torch.cuda.device(0)
<torch.cuda.device at 0x7efce0b03be0>

>>> torch.cuda.get_device_name(0)
'GeForce GTX 950M'
This tells us:

CUDA is available and can be used by one device.

Device 0 refers to the GPU GeForce GTX 950M, and it is currently chosen by PyTorch. ‘’

Tests

The tests for that part are run from the repo by:

make nvidia-smi-dummy-load-tests > ./logs/test_nvidia-smi_dummy-load.log

(Or as a part of make log-all-tests)

It basically run two containers using Docker Compose showing the:

Basic behaviour: RR5555/compose-gpu_access_restriction/nvidia-smi/compose_testGPUAccess_dummy_load.yml

Behaviour when using pid: host (similar to docker run --pid=host): RR5555/compose-gpu_access_restriction/nvidia-smi/compose_testGPUAccess_dummy_load_pid_host.yml
This translates into the following yml:

name: gpu_access_test

services:

  test:
    image: rr5555/gpu_access_test:test_both
    runtime: nvidia
    pid: host
    deploy:
      restart_policy:
        condition: on-failure

    volumes:
      - ./code:/code
    entrypoint: >
      bash -c "python3 /code/base_fct.py & sleep 1; ps -aux | grep -i python3; nvidia-smi; sleep 5"

Please read the Search section and/or the Take-aways section before considering using pid: host.

The container used here rr5555/gpu_access_test:test_both is simply a cuda12 ubuntu22.04 img with pip3, pytorch and pynvml installed (see the associated Dockerfile).

They both run the following entrypoint:

python3 /code/base_fct.py & sleep 1; ps -aux | grep -i python3; nvidia-smi; sleep 5

Where the main of RR5555/compose-gpu_access_restriction/nvidia-smi/code/base_fct.py is simply loading a tensor to one device and then run a fct to parse the pids of running programs obtained using pynvml, and try to get the associated cmd from ps aux.

The output (slightly edited to improve readability) of the test can be found below. It should be noted that the python is launched, then the sanity check ps -aux | grep -i python3 is displayed, followed by nvidia-smi and lastly, the python script finishes: extracting pid of GPU running programs, and trying to find out from ps aux its corresponding cmd.

Output of the test:

Testing nvidia-smi processes:
...
root           1  0.0  0.0   4364  2688 ?        Ss   11:19   0:00 bash -c python3 /code/base_fct.py & ps -aux | grep -i python3; nvidia-smi; sleep 5
root           7  0.0  0.0   9916   896 ?        D    11:19   0:00 python3 /code/base_fct.py
root           9  0.0  0.0   3472  1344 ?        S    11:19   0:00 grep -i python3
Sun Apr 21 11:19:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN X (Pascal)        Off | 00000000:03:00.0 Off |                  N/A |
| 23%   27C    P8               9W / 250W |     29MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN X (Pascal)        Off | 00000000:04:00.0 Off |                  N/A |
| 23%   24C    P8               8W / 250W |      6MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN X (Pascal)        Off | 00000000:81:00.0 Off |                  N/A |
| 23%   25C    P8              10W / 250W |      6MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN V                 Off | 00000000:82:00.0 Off |                  N/A |
| 28%   33C    P8              24W / 250W |      5MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
'GPU:0\nno processes are running'
GPU:0
no processes are running
['no processes are running']
no processes are running
{}
'GPU:1\nno processes are running'
GPU:1
no processes are running
['no processes are running']
no processes are running
{}
'GPU:2\nno processes are running'
GPU:2
no processes are running
['no processes are running']
no processes are running
{}
'GPU:3\nprocess     284964 uses      326.000 MB GPU memory'
GPU:3
process     284964 uses      326.000 MB GPU memory
['process     284964 uses      326.000 MB GPU memory']
284964
{'284964': []}
...
Testing nvidia-smi processes with pid host:
...
root        1401  0.0  0.0  42344 19712 ?        Ss   Apr20   0:00 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
root        1565  0.0  0.0 119444 23296 ?        Ssl  Apr20   0:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
root      285927  3.0  0.0   4364  2688 ?        Ss   11:19   0:00 bash -c python3 /code/base_fct.py & sleep 1; ps -aux | grep -i python3; nvidia-smi; sleep 5
root      285983  102  0.4 2911560 309120 ?      R    11:19   0:01 python3 /code/base_fct.py
root      286004  0.0  0.0   3472  1792 ?        S    11:20   0:00 grep -i python3
Sun Apr 21 11:20:01 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA TITAN X (Pascal)        Off | 00000000:03:00.0 Off |                  N/A |
| 23%   27C    P8               8W / 250W |     29MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN X (Pascal)        Off | 00000000:04:00.0 Off |                  N/A |
| 23%   24C    P8               9W / 250W |      6MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN X (Pascal)        Off | 00000000:81:00.0 Off |                  N/A |
| 23%   25C    P8               8W / 250W |      6MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN V                 Off | 00000000:82:00.0 Off |                  N/A |
| 28%   33C    P2              25W / 250W |      5MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2313      G   /usr/lib/xorg/Xorg                           18MiB |
|    0   N/A  N/A      3072      G   /usr/bin/gnome-shell                          8MiB |
|    1   N/A  N/A      2313      G   /usr/lib/xorg/Xorg                            4MiB |
|    2   N/A  N/A      2313      G   /usr/lib/xorg/Xorg                            4MiB |
|    3   N/A  N/A      2313      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
'GPU:0\nno processes are running'
GPU:0
no processes are running
['no processes are running']
no processes are running
{}
'GPU:1\nno processes are running'
GPU:1
no processes are running
['no processes are running']
no processes are running
{}
'GPU:2\nno processes are running'
GPU:2
no processes are running
['no processes are running']
no processes are running
{}
'GPU:3\nprocess     285983 uses      326.000 MB GPU memory'
GPU:3
process     285983 uses      326.000 MB GPU memory
['process     285983 uses      326.000 MB GPU memory']
285983
{'285983': ['root      285983 80.0  0.7 20696168 479892 ?     Sl   11:19   0:01 python3 /code/base_fct.py\n']}

Please note the presence of a process with pid 285983 in ps part of Testing nvidia-smi processes with pid host: which corresponds to the program: python3 /code/base_fct.py. Whereas 284964 in Testing nvidia-smi processes: was nowhere to be found in the container, and python3 /code/base_fct.py had pid 7 inside the container but could not be directly identified by NVML.

Take-aways

Without any option modification, you check GPU utilization in particular processes running on it using NVML, and in the case of python pynvml which is just a wrapper for NVML. However, the returned PIDs are nowhere to be found inside the container.

One could also Pytorch where a lot of functions are available (see PyTorch Docs – torch.cuda):
torch.cuda.is_available()
torch.cuda.device_count()
torch.cuda.current_device()
torch.cuda.device(0)
torch.cuda.get_device_name(0)
torch.cuda.get_device_properties(0)
torch.cuda.memory_usage(0)
torch.cuda.utilization(0)
torch.cuda.list_gpu_processes(0)
However, the more advanced function torch.cuda.list_gpu_processes actually requires pynvml (see torch.cuda.list_gpu_processes SOURCE).

From the Search, the issue stems from the separation between the host PID namespace and the container PID namespace. This is actually part of the isolation benefits provided by containerization. Indeed, we can see in the proposed solutions and in the tests, that sharing the PID namespace of the host to the container fixes nvidia-smi along with a shared pid for the program on the GPU, and thus, allowing us to find the host pid that NVML was getting in the container PID namespace.

Using pid: host (or --pid=host) could allow the container to interfere with host processes which is usually not desirable.

More specifically, it would actually be a flaw in the Nvidia driver when launched from inside the container:

From GH matpool/mpu
’’ The NVIDIA driver is not aware of the PID namespace and nvidia-smi has no capability to map global pid to virtual pid, thus it shows nothing. ‘’

Some people have apparently provided fixes for that flaw by directly correcting the driver:

GH matpool/mpu
GH gh2o/nvidia-pidns

I have neither tried nor understood yet the fixes that are proposed in the links above. matpool/mpu provides a detailed explanation of the process that he was able to reconstruct, hereby making it possible to compare the tentative process with the proposed fix. However, the details on how he arrived at the tentative process are not given, only an overall high-level list of the steps. Tracking, for myself, the detailed steps will be a mission for another day.

General wisdom: Only run code that you trust. (Though, in practice, some compromises are made, and vetting is often left to others hopefully more competent in the case of open-source. But the recent XZ issue still serves as a cautionary tale.)

or by using an additional pod (in the specific case of k8s):

From dmrub’s answer
https://github.com/dmrub/kube-utils/blob/master/kube-nvidia-get-processes.py

Note that the tool runs a privileged pod on each node with GPU to map the node PID to the container PID.

privileged mode is another topic, but lest to say that, it also comes with security risks, and thus, is not enabled by default. It is however necessary in some cases/applications. More on that when I will post about Docker-in-Docker (DinD).

For now, no official fix have been implemented. But I think that matpool’s or gh2o’s fixes could be an answer if trusted and/or verified. But until I have the time to or someone dive into these potential fixes, for me at least, the problem will remain open.

CUDA devices in Docker Compose

Problem

The problem is simple, how to restrict access to GPUs in a container/service in Docker Compose? This is mainly to make sure that services will not interfere with GPUs that are assigned for other purposes, that is considering ressource management, and not considering ill-intent.

The typical use case here would be to have a container only having access to one specific GPU out of several available.

Search

Limit GPU usage in nvidia-docker?
From opHASnoNAME’s post
’’ You can try it with nvidia-docker-compose
```
version: "2"
services
  process1:
    image: nvidia/cuda
    devices:
      - /dev/nvidia0
```
’’
From venergiac’s post
’’

There are 3 options.

Docker with NVIDIA RUNTIME (version 2.0.x)

According to official documentation
```
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=2,3
```
nvidia-docker (version 1.0.x)

based on a popular post
```
nvidia-docker run .... -e CUDA_VISIBLE_DEVICES=0,1,2
```
(it works with tensorflow) programmatically
```
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2"
```
’’

From Turn on GPU access with Docker Compose

’’ Access specific devices

To allow access only to GPU-0 and GPU-3 devices:

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0', '3']
            capabilities: [gpu]

’’

Tests

The tests for that part are run from repo by:

make all-nvidia-smi-tests > ./logs/test_nvidia-smi.log
make all-pytorch-tests > ./logs/test_pytorch.log
make all-pynvml-tests > ./logs/test_pynvml.log

(Or as a part of make log-all-tests)

I basically made compose.yml files:

version: "0.1"
name: gpu_access_test


services:

  test:
    image: rr5555/gpu_access_test:test_pytorch
    runtime: nvidia
    deploy:
      restart_policy:
        condition: on-failure

    entrypoint: >
      bash -c "printf 'import torch\\nprint(f\"torch.cuda.is_available():{torch.cuda.is_available()}\")\\nprint(f\"torch.cuda.device_count():{torch.cuda.device_count()}\")' | python3"

each with a different configuration regarding the cuda devices declaration:

device_ids:

  services:

      test:
      image: rr5555/gpu_access_test:test_pytorch
      deploy:
        restart_policy:
          condition: on-failure
        resources:
          reservations:
            devices:
            - driver: nvidia
              device_ids:
                - "0"
              capabilities: [gpu]

GPU_ID:

  services:

    test:
      image: rr5555/gpu_access_test:test_pytorch
      deploy:
        restart_policy:
          condition: on-failure
        resources:
          reservations:
            devices:
            - driver: nvidia
              capabilities: [gpu]


      environment:
        GPU_ID: 0

CUDA_VISIBLE_DEVICES:

  services:

    test:
      image: rr5555/gpu_access_test:test_pytorch
      deploy:
        restart_policy:
          condition: on-failure
        resources:
          reservations:
            devices:
            - driver: nvidia
              capabilities: [gpu]


      environment:
        CUDA_VISIBLE_DEVICES: 0

NVIDIA_VISIBLE_DEVICES:

  services:

    test:
      image: rr5555/gpu_access_test:test_pytorch
      deploy:
        restart_policy:
          condition: on-failure
        resources:
          reservations:
            devices:
            - driver: nvidia
              capabilities: [gpu]


      environment:
        NVIDIA_VISIBLE_DEVICES: 0

and three test versions:

one using pytorch:

entrypoint: >
    bash -c "printf 'import torch\\nprint(f\"torch.cuda.is_available():{torch.cuda.is_available()}\")\\nprint(f\"torch.cuda.device_count():{torch.cuda.device_count()}\")' | python3; NVIDIA_VISIBLE_DEVICES=all printf 'import torch\\nprint(f\"torch.cuda.is_available():{torch.cuda.is_available()}\")\\nprint(f\"torch.cuda.device_count():{torch.cuda.device_count()}\")' | python3"

one using pynvml:

entrypoint: >
    bash -c "printf 'from pynvml import *\\nnvmlInit()\\nprint(\"Driver Version:\", nvmlSystemGetDriverVersion())\\ndeviceCount = nvmlDeviceGetCount()\\nfor i in range(deviceCount):\\n\\thandle = nvmlDeviceGetHandleByIndex(i)\\n\\tprint(\"Device\", i, \":\", nvmlDeviceGetName(handle))\\nnvmlShutdown()' | python3; NVIDIA_VISIBLE_DEVICES=all printf 'from pynvml import *\\nnvmlInit()\\nprint(\"Driver Version:\", nvmlSystemGetDriverVersion())\\ndeviceCount = nvmlDeviceGetCount()\\nfor i in range(deviceCount):\\n\\thandle = nvmlDeviceGetHandleByIndex(i)\\n\\tprint(\"Device\", i, \":\", nvmlDeviceGetName(handle))\\nnvmlShutdown()' | python3"

on using nvidia-smi:

entrypoint: >
    bash -c "nvidia-smi; echo 'Env. reverse back:'; NVIDIA_VISIBLE_DEVICES=all nvidia-smi"

Where we just check what GPUs are detected with each tool/library, and then rerun the test after setting the environment variable back to all inside the container when applicable, to make sure that the previous set variable was to be taken into account for setting the service/container, and not just setting the environment variable inside the container. (The tests do indeed indicates the former, that it is taken into account to set the container, and not merely, having these values in the shell of the container.)

NVIDIA_VISIBLE_DEVICES=all, see Specialized Configurations with Docker

GPU_IDS=all, see Running Redact Enterprise Onpremise – GPU_IDS

You can find the three docker imgs that I made and used on DockerHub:

rr5555/gpu_access_test:test_pytorch: basically a cuda12 ubuntu22.04 img with pip3 and pytorch installed (see the associated Dockerfile)
rr5555/gpu_access_test:test_pynvml: basically a cuda12 ubuntu22.04 img with pip3 and pynvml installed (see the associated Dockerfile)
rr5555/gpu_access_test:test_nvidia_smi: just a cuda12 ubuntu22.04 img (see the associated Dockerfile)

The edited results of the tests can be found in the following table:

	`pytorch`	`pynvml`	`nvidia-smi`
Control	torch.cuda.is_available():True torch.cuda.device_count():4	Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) Device 1 : NVIDIA TITAN X (Pascal) Device 2 : NVIDIA TITAN X (Pascal) Device 3 : NVIDIA TITAN V	0 NVIDIA TITAN X (Pascal) 1 NVIDIA TITAN X (Pascal) 2 NVIDIA TITAN X (Pascal) 3 NVIDIA TITAN V
`device_ids`	torch.cuda.is_available():True torch.cuda.device_count():1	Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal)	0 NVIDIA TITAN X (Pascal)
`GPU_ID`	torch.cuda.is_available():True torch.cuda.device_count():4	Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) Device 1 : NVIDIA TITAN X (Pascal) Device 2 : NVIDIA TITAN X (Pascal) Device 3 : NVIDIA TITAN V	0 NVIDIA TITAN X (Pascal) 1 NVIDIA TITAN X (Pascal) 2 NVIDIA TITAN X (Pascal) 3 NVIDIA TITAN V
`CUDA_VISIBLE_DEVICES`	torch.cuda.is_available():True torch.cuda.device_count():1	Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal) Device 1 : NVIDIA TITAN X (Pascal) Device 2 : NVIDIA TITAN X (Pascal) Device 3 : NVIDIA TITAN V	0 NVIDIA TITAN X (Pascal) 1 NVIDIA TITAN X (Pascal) 2 NVIDIA TITAN X (Pascal) 3 NVIDIA TITAN V
`NVIDIA_VISIBLE_DEVICES`	torch.cuda.is_available():True torch.cuda.device_count():1	Driver Version: 535.171.04 Device 0 : NVIDIA TITAN X (Pascal)	0 NVIDIA TITAN X (Pascal)

(I also tested with GPU_IDS, just in case, but only for nvidia-smi, the results were similar to those of GPU_ID.)

Take-aways

It then comes that, to restrict access to GPUs, the most consistent ways are: device_ids and NVIDIA_VISIBLE_DEVICES.
CUDA_VISIBLE_DEVICES does not have a consistent behaviour across pytorch, pynvml and nvidia-smi.

Context

nvidia-smi & nvml in an nvidia container

Problem

Search

Tests

Take-aways

CUDA devices in Docker Compose

Problem

Search

Tests

Take-aways

`nvidia-smi` & `nvml` in an nvidia container