Hello everyone,
Today the nvidia driver on my server stopped working out of nowhere. Yesterday it was working and today it’s not. I didn’t do anything in yesterday or today.
Today my Plex container stopped working because there was a problem with the nvidia card I was using for transcoding. It’s a GTX 1650.
I tried running nvidia-smi
and it said Failed to initialize NVML: Driver/library version mismatch
. After I tried upgrading my system because it was a months ago I upgraded, maybe it will help. It didn’t. I tried some rebooting because some sources said it solves the issue but it persisted.
It’s driver reinstall time. Purged the driver with apt purge nvidia*
then installed driver with ubuntu-drivers install --gpgpu nvidia:525-server
. After reboot nvidia-smi
gives the error NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
.
lsmod | grep nvidia
shows nothing and /proc/driver/nvidia/version
doesn’t exists. I tried starting nvidia-persistenced with systemctl but it gives this error:
Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 113 has read and write permissions for those files.
/dev/nvidia* doesn’t exist.
I’m very noobish when it comes to nvidia and linux it was a pain to set it up initially and I was hoping that it wouldn’t go wrong someday. But here I am unfortunatelly. I don’t really know what logs should I show you or what commands should I run to troubleshoot so every tip is appreciated and I will provide logs and things like that if needed.
System info:
- Ubuntu Server 22.04
- kernel: 5.15.0-76-generic
- theoretically installed nvidia driver: nvidia-driver-525-server
Solution
I was using the ubuntu-drivers utility to install the driver but turns out it’s not that great. After installing with the manual method from https://help.ubuntu.com/community/NvidiaDriversInstallation using the command apt install linux-modules-nvidia-${DRIVER_BRANCH}${SERVER}-${LINUX_FLAVOUR}
it’s working again.
Does it even show up in lspci? Eliminate your OS, boot it in a live system and see if it’s recognized there. A quick thing to check would be that your GPU is actually powered on (fully seated in the PCIe slot and has the necessary power).
Shows up in lspci. Booting a live OS would be a little bit tricky because it’s in a wall mounted rack but I will try that if nothing else works. Thank you.
So it sees the hardware, but the kernel module isn’t being loaded. I’d guess if you tried to load it with modprobe, it would complain about some version mismatch.
So, I’d do the uninstall and reinstall processes on this page: https://help.ubuntu.com/community/NvidiaDriversInstallation
I was using the ubuntu-drivers utility that this page mentions too but it turns out it isn’t working very much. Now I installed with the manual method from this page using
apt install linux-modules-nvidia-${DRIVER_BRANCH}${SERVER}-${LINUX_FLAVOUR}
and it’s working. Thank you for the suggestion!
Had a similar issue, downloading the GPUs exact driver from nvidia, installing it and restarting worked.