0 background
Recently, the gpu server in the laboratory is always out of order. You need to reinstall the graphics card driver. Online tutorials are very old, and many of them don't know what to say.
Installation methods and results attempted by the author:
Download driver from the official website - Installation failed
Directly use cuda toolkit to install drivers and cuda in one breath - Installation failed
Therefore, the author adopts the method in this paper.
The purpose of this tutorial is to record my experience of successful installation. It can be installed successfully in the same system environment. Different system environments are only for reference and do not guarantee success.
The purpose of this paper is to be concise, direct copy, command execution, reproducible and readable.
Environment of this document:
ubuntu server 20.04, other systems are for reference only!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
1 install nvidia driver
1.1 check whether gcc is installed
gcc -v
If it is not installed, enter the following command to directly install many development kits including gcc
sudo apt-get install build-essential
1.2 disable nouveau drive
Edit the / etc/modprobe.d/blacklist-nouveau.conf file and add the following:
blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off
Close nouveau:
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
When complete, regenerate the kernel and restart:
sudo update-initramfs -u sudo reboot
After restart, execute: lsmod | grep nouveau. If there is no screen output, it indicates that nouveau is disabled successfully. Otherwise, sub section 1.2 shall be re executed.
1.3 installing the drive
Use the command Ubuntu drivers devices to obtain the available driver information. If the command does not exist, install it yourself.
The output is (different computers have different output according to the configuration. I also report an error here, but it does not affect it)
ERROR:root:could not open aplay -l Traceback (most recent call last): File "/usr/share/ubuntu-drivers-common/detect/sl-modem.py", line 35, in detect aplay = subprocess.Popen( File "/usr/lib/python3.8/subprocess.py", line 854, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'aplay' == /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 == modalias : pci:v000010DEd00002204sv000010DEsd00001454bc03sc00i00 vendor : NVIDIA Corporation driver : nvidia-driver-470 - distro non-free recommended driver : nvidia-driver-460 - distro non-free driver : nvidia-driver-470-server - distro non-free driver : nvidia-driver-495 - distro non-free driver : nvidia-driver-460-server - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin
Find the driver from the above information, and then find the recommend ed driver, nvidia-driver-470. Considering the ubuntu server, I finally chose NVIDIA driver 470 server.
Execute the command to install the driver: sudo apt install NVIDIA driver-470-server
After the installation is completed, execute NVIDIA SMI to output the gpu monitoring interface, and the driver installation is successful! From the monitoring information, we can see that cuda version is 11.4, so we also installed this version when installing cuda toolkit.
2 install cuda
stay https://developer.nvidia.com/cuda-toolkit-archive The corresponding version was found in. Here, we use the version of 11.4 and the installation form of runfile.
Directly enter the following command:
wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run sudo sh cuda_11.4.0_470.42.01_linux.run
Note that when you are reminded that you have installed the driver, continue directly. When selecting the installation content, be sure to cancel the x in front of the driver, because we have installed the driver!!!
Restart after installation, enter nvcc -V to display relevant information, and the installation is successful!
3. Solve the problem that the server cannot be connected after ssh for a period of time
After installing the driver, I found that the server could not be connected to ssh after a period of time. I had to restart it. Later, after reading the server log, I found that the server set automatic suspension.
Enter the command: systemctl status sleep.target
Output information:
● sleep.target - Sleep Loaded: loaded (/lib/systemd/system/sleep.target; static; vendor preset: enabled) Active: inactive (dead) Docs: man:systemd.special(7)
You can find loaded. This indicates that automatic sleep is enabled.
We enter the command: sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target to disable it.
Enter the command: systemctl status sleep.target
Output information:
● sleep.target Loaded: masked (Reason: Unit sleep.target is masked.) Active: inactive (dead)
Description: Disabled successfully!
Remind again:
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
If the reader relies on the graphical interface, this article does not carry out special consideration and verification. It is uncertain whether there will be a problem. Please pay attention to it and use it in combination with other tutorials!
4 references
https://blog.csdn.net/qq_34387533/article/details/116011839
https://www.cnblogs.com/pprp/p/9430836.html
https://zhuanlan.zhihu.com/p/393152883