Using Nvidia GPUs in Podman containers in Fedora 37
Okay why not docker though?
It's been a while since Fedora has moved away from docker to podman. I'm not going into the nitty gritty details on docker vs podman but if you really want to know more you can read this article.
Source
This article has been compiled from https://ask.fedoraproject.org/t/how-to-run-tensorflow-gpu-in-podman/8486/9.
Getting started - Nvidia Driver
Make sure to have the non free RPM Fusion Nvidia driver installed before you start because nvidia-container-toolkit
will need that driver. (No the open source driver won't do.)
If you don't have it already, then run the following commands to install it.
sudo rpm -Uvh http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm
sudo rpm -Uvh http://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
# install the driver
# as per : https://rpmfusion.org/Howto/NVIDIA
sudo dnf update -y
sudo dnf install -y xorg-x11-drv-nvidia
Wait for a while (the kernel needs to run mods for the drivers) and then type reboot
in the terminal and hit enter.
Install Podman
Podman should be installed by default but there's no harm in checking!
podman --version
Check the CUDA version on host
Running nvidia-smi
should suffice.
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:2D:00.0 On | N/A |
| 0% 39C P8 27W / 420W | 1035MiB / 24576MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 7451 G /usr/libexec/Xorg 380MiB |
| 0 N/A N/A 7651 G /usr/bin/gnome-shell 205MiB |
| 0 N/A N/A 7958 G ...mail/ 4MiB |
| 0 N/A N/A 8904 C+G ...933855740524873253,131072 155MiB |
| 0 N/A N/A 15775 G ...e/Steam/ubuntu12_32/steam 101MiB |
| 0 N/A N/A 16072 G ...ef_log.txt --shared-files 183MiB |
+-----------------------------------------------------------------------------+
Install the nvidia-container-toolkit
Nvidia officially provides conatiner toolkit releases for RHEL but not for Fedora. However, since it's compatible with RHEL, Fedora should be too. So what you have to do here is use the release for RHEL. Use the latest release of RHEL (which is now 9.x).
distribution=rhel9.0
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
Now you can install it .
sudo dnf clean expire-cache \
&& sudo dnf install -y nvidia-container-toolkit
Configuration tweaks to run rootless containers
By default the container toolkit requires that you run GPU containers as root. This isn't ideal and can open up security issues in many cases. Running as a user process should be the better alternative.
sudo sed -i 's/^#no-cgroups = false/no-cgroups = true/;' /etc/nvidia-container-runtime/config.toml
Does it work?
There are a few ways to check. We can run some gpu specific code inside or, just call nvidia-smi inside a podman container. So for a container, we need an image. You can find cuda images on docker hub:hub.docker.com/r/nvidia/cuda . Visit the tags
section and pick the one that matches the cuda version on your host OS (check nvidia.smi
output from earlier).
Let's try with nvidia-smi
.
podman run --rm --security-opt=label=disable \
--hooks-dir=/usr/share/containers/oci/hooks.d/ \
nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
When asked to select a registry, select the docker.io one. If you see something similar the the output I after running a container, your installation was successful. However, if you get some error at this point, it'll most likely be an issue with the driver (especially if you're still on the open source GPU drivers) or a cuda version mismatch.
Now why do we need that --security-opt=label=disable
flag? Because SELinux will block your container from accessing the GPU otherwise.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:2D:00.0 On | N/A |
| 0% 38C P8 26W / 420W | 993MiB / 24576MiB | 6% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Now you can try defining some Dockerfile with the code you want to run on your GPU and it should work fine. You can also try the sample workload container image from Nvidia.
Fertig!
That's kinda it!