Freiheit

Freiheit

Using Nvidia GPUs in Podman containers in Fedora 34

Okay why not docker though?

It's been a while since Fedora has moved away from docker to podman. I'm not going into the nitty gritty details on docker vs podman but if you really want to know more you can read this article.

Source

This article has been compiled from https://ask.fedoraproject.org/t/how-to-run-tensorflow-gpu-in-podman/8486/9.

Getting started - Nvidia Driver

Make sure to have the non free RPM Fusion Nvidia driver installed before you start because nvidia-container-runtime will need that driver. (No the open source driver won't do.)

If you don't have it already, then run the following commands to install it.

sudo rpm -Uvh http://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm

sudo rpm -Uvh http://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

# install the driver
# as per : https://rpmfusion.org/Howto/NVIDIA

sudo dnf update -y
sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda

Wait for a while (the kernel needs to run mods for the drivers) and then type reboot in the terminal and hit enter.

Install Podman

Podman should be installed by default but there's no harm in checking!

podman --version

Check the CUDA version on host

Running nvidia-smi should suffice.

nvidia-smi 




Sun Sep  5 00:56:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0  On |                  N/A |
|  0%   41C    P8    23W / 420W |    795MiB / 24265MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2254      G   /usr/libexec/Xorg                 549MiB |
|    0   N/A  N/A      2380      G   /usr/bin/gnome-shell              133MiB |
|    0   N/A  N/A     10495      G   /usr/lib64/firefox/firefox         16MiB |
|    0   N/A  N/A     14440      G   ...AAAAAAAAA= --shared-files        9MiB |
|    0   N/A  N/A     14647      G   ...AAAAAAAAA= --shared-files       81MiB |
+-----------------------------------------------------------------------------+

The CUDA version of the driver installed on your host should match the one inside the container.

Install the nvidia-container-runtime

Nvidia officially provides conatiner runtime releases for RHEL but not for Fedora. However, since it's compatible with RHEL, Fedora should be too. So what you have to do here is use the release for RHEL. First, check which is the latest version of RHEL that has a release for the container runtime here. At the time of writing this article the latest was rhel8.3.

 distribution=rhel8.3

 curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo

Now you can install it with yum.

 sudo yum install nvidia-container-runtime

Configuration tweaks

Open /etc/nvidia-container-runtime/config.toml as super user with the text editor of your choice and set no-cgroups to true .

Does it work?

There are a few ways to check. We can run some gpu specific code inside or, just call nvidia-smi inside a podman container. So for a container, we need an image. You can find cuda images on docker hub:hub.docker.com/r/nvidia/cuda . Visit the tags section and pick the one that matches the cuda version on your host OS (check nvidia.smi output from earlier).

I'm selecting the tag 11.4.1-cudnn8-runtime-ubuntu18.04 since it matches the cuda version on my host OS.

Let's try with nvidia-smi .

 podman run -it --rm --security-opt=label=disable nvidia/cuda:11.4.1-cudnn8-runtime-ubuntu18.04 nvidia-smi

When asked to select a registry, select the docker.io one. If you see something similar the the output I after running a container, your installation was successful. However, if you get some error at this point, it'll most likely be an issue with the driver (especially if you're still on the open source GPU drivers) or a cuda version mismatch.

Now why do we need that --security-opt=label=disable flag? Because SELinux will block your container from accessing the GPU otherwise.

 ✔ docker.io/nvidia/cuda:11.4.1-cudnn8-runtime-ubuntu18.04
Trying to pull docker.io/nvidia/cuda:11.4.1-cudnn8-runtime-ubuntu18.04...
Getting image source signatures
Copying blob 9a97b918599a done  
Copying blob 892b7ccad582 done  
Copying blob a7b1030fa5b0 done  
Copying blob 221371fb3b92 done  
Copying blob feac53061382 done  
Copying blob b20626029a8b done  
Copying blob 9594904f3bfd done  
Copying blob eb6f6fd0b83c done  
Copying config f8a6367a31 done  
Writing manifest to image destination
Storing signatures
Sun Sep  5 00:38:18 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0  On |                  N/A |
|  0%   45C    P8    33W / 420W |   1117MiB / 24265MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

How about some PyTorch?

We'll need our own container for that. Let's write one then! Wait, what code should we run inside? We just need to check if CUDA works right? We can just use the following snippet:

 import torch

 print(torch.cuda.is_available())

Or, fine tune BERT for some text classification. (I'm not writing the code for that in this post for obvious reasons. You'll find the code here in this Github repo ).

FROM nvidia/cuda:11.4.1-cudnn8-runtime-ubuntu18.04

WORKDIR /home/cuda_test
COPY . .

RUN apt update && apt install build-essential -y

RUN apt install -y python3
RUN apt install -y python3-pip

RUN pip3 install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
RUN pip3 install transformers
RUN pip3 install numpy
RUN pip3 install scikit-learn
RUN pip3 install pytorch-lightning

CMD [ "python3", "main.py" ]

Build the image (use any name you like):

 podman build -t cuda-test-ubuntu1804 .

 # check the built image list

 localhost/cuda-test-ubuntu1804  latest                             bf01827801a0  2 minutes ago   11.2 GB
docker.io/nvidia/cuda           11.4.1-cudnn8-runtime-ubuntu18.04  f8a6367a313e  4 weeks ago     3.58 GB

Now just to run the simple snippet from above:

 podman run -it --rm --security-opt=label=disable localhost/cuda-test-ubuntu1804 \
 python3 -c "import torch; print(torch.cuda.is_available())"
 True

And for the BERT fine tuning code:

 podman run --rm --security-opt=label=disable localhost/cuda-test-ubuntu1804
Downloading, sit tight!
>> Downloading review_polarity.tar.gz 100.06734377108491%%
Successfully downloaded review_polarity.tar.gz 3127238 bytes

Unzipping ...
Deleting downloaded zip file
prepare_corpus: 100%|##########| 1000/1000 [00:00<00:00, 2818752.69it/s]
prepare_corpus: 100%|##########| 1000/1000 [00:00<00:00, 2543543.97it/s]
Downloading: 100%|##########| 29.0/29.0 [00:00<00:00, 35.0kB/s]
Downloading: 100%|##########| 570/570 [00:00<00:00, 647kB/s]
Downloading: 100%|##########| 213k/213k [00:00<00:00, 531kB/s] 
Downloading: 100%|##########| 436k/436k [00:00<00:00, 1.50MB/s]
Downloading: 100%|##########| 436M/436M [00:03<00:00, 113MB/s]  
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type      | Params
--------------------------------------
0 | bert    | BertModel | 108 M 
1 | linear  | Linear    | 769   
2 | sigmoid | Sigmoid   | 0     
3 | loss_fn | BCELoss   | 0     
--------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
433.244   Total estimated model params size (MB)
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/data_loading.py:106: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/data_loading.py:106: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Epoch 1: 100%|##########| 100/100 [00:09<00:00, 10.23it/s, loss=0.524, v_num=0, val_loss=0.537]LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]                                                          

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/data_loading.py:106: UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Testing:  92%|#########2| 23/25 [00:01<00:00, 20.64it/s]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'f1_score::batch0': 0.7142857313156128,
 'f1_score::batch1': 0.5333333611488342,
 'f1_score::batch10': 0.4000000059604645,
 'f1_score::batch11': 0.625,
 'f1_score::batch12': 0.2857142984867096,
 'f1_score::batch13': 0.4000000059604645,
 'f1_score::batch14': 0.7058823704719543,
 'f1_score::batch15': 0.7142857313156128,
 'f1_score::batch16': 0.7142857313156128,
 'f1_score::batch17': 0.5,
 'f1_score::batch18': 0.75,
 'f1_score::batch19': 0.7272728085517883,
 'f1_score::batch2': 0.6666666865348816,
 'f1_score::batch20': 0.6666666865348816,
 'f1_score::batch21': 0.7692307829856873,
 'f1_score::batch22': 0.75,
 'f1_score::batch23': 0.7368420958518982,
 'f1_score::batch24': 0.6153846383094788,
 'f1_score::batch3': 0.8999999165534973,
 'f1_score::batch4': 0.8421052694320679,
 'f1_score::batch5': 0.7058823704719543,
 'f1_score::batch6': 0.6666666865348816,
 'f1_score::batch7': 0.7142857313156128,
 'f1_score::batch8': 0.6666666865348816,
 'f1_score::batch9': 0.7272728085517883}
--------------------------------------------------------------------------------
Testing: 100%|##########| 25/25 [00:01<00:00, 20.60it/s]

Fertig!

That's kinda it. I don't know any cheesy lines to finish articles so that'll be it I guess.

 
Share this