Not that I have any serious processing to do, but a few days ago I decided to check how it could be done if/when the need is there. It can’t hurt to have the ability I thought. Since I already have a Kubernetes cluster, it would be silly to write some custom code for distributing the jobs, and the GPUs are more suitable for the imagined heavy lifting compared to the CPUs. A Kubernetes operator sounds like the way to go for the actual interaction with the hardware driver, and sure, Google lets us know that Nvidia has a GPU operator hosted on GitHub. It was a too obvious idea for me to be the first one to think along those lines…
Since I am running Ubuntu’s Kubernetes distribution, MicroK8s, I also had a look to see what they offer, and they provide an addon which attempts to bundle the operator and pre-configure it to fit MicroK8s out-of-the-box. Sounds like the way to go, a simple “microk8s enable gpu” is suggested. Unfortunately that did not work for me despite a number of attempts with various parameters. Maybe it works for others but in my situation, where I already have the driver installed on the nodes that have GPUs, and want to use that host driver, I had no luck despite specifying the latest driver version and forced host driver. So, back to square one and I decided to try my luck with using Nvidias GPU operator “directly”. The MicroK8s add-on installs to the namespace “gpu-operator-resources” by default, so a simple “microk8s disable gpu” and deletion of all resources in that namespace (“microk8s delete namespace gpu-operator-resources”, to avoid conflicts, put us back to a reasonable starting position.
In the Nvidia documentation there is a section about the Containerd settings to use with MicroK8s so the paths are matching what MicroK8s expect. And by specifying “driver.enabled=false” in order to avoid the nvidia driver as a container and using the pre-installed host driver, we have a winner:
microk8s helm install gpu-operator -n gpu-operator --create-namespace \
nvidia/gpu-operator --set driver.enabled=false \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true
At least with that, the resources in the gpu-operator namespace are healthy, and it passes the validation test (“microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator”) and can run that CUDA sample application “cuda-vector-add”.
Now I just have to figure out what to do with it… Re-encoding movies, forecasting the local weather based on measurements on the balcony and open weather data, beating the gambling firms or the hedge funds. The opportunities are endless for the naïve developer. 🙂