[2022 new tutorial] Ubuntu server 20.04 how to install nvidia driver and cuda- solve the problem that the server cannot be connected after ssh for a period of time

Posted by yddib on Thu, 11 Nov 2021 02:31:35 +0100

0 background

Recently, the gpu server in the laboratory is always out of order. You need to reinstall the graphics card driver. Online tutorials are very old, and many of them don't know what to say.
Installation methods and results attempted by the author:
Download driver from the official website - Installation failed
Directly use cuda toolkit to install drivers and cuda in one breath - Installation failed
Therefore, the author adopts the method in this paper.
The purpose of this tutorial is to record my experience of successful installation. It can be installed successfully in the same system environment. Different system environments are only for reference and do not guarantee success.
The purpose of this paper is to be concise, direct copy, command execution, reproducible and readable.
Environment of this document:
ubuntu server 20.04, other systems are for reference only!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!

1 install nvidia driver

1.1 check whether gcc is installed

gcc -v

If it is not installed, enter the following command to directly install many development kits including gcc

sudo apt-get install build-essential

1.2 disable nouveau drive

Edit the / etc/modprobe.d/blacklist-nouveau.conf file and add the following:

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Close nouveau:

echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf

Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
When complete, regenerate the kernel and restart:

sudo update-initramfs -u
sudo reboot

After restart, execute: lsmod | grep nouveau. If there is no screen output, it indicates that nouveau is disabled successfully. Otherwise, sub section 1.2 shall be re executed.

1.3 installing the drive

Use the command Ubuntu drivers devices to obtain the available driver information. If the command does not exist, install it yourself.
The output is (different computers have different output according to the configuration. I also report an error here, but it does not affect it)

ERROR:root:could not open aplay -l
Traceback (most recent call last):
  File "/usr/share/ubuntu-drivers-common/detect/sl-modem.py", line 35, in detect
    aplay = subprocess.Popen(
  File "/usr/lib/python3.8/subprocess.py", line 854, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'aplay'
== /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 ==
modalias : pci:v000010DEd00002204sv000010DEsd00001454bc03sc00i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-470 - distro non-free recommended
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-495 - distro non-free
driver   : nvidia-driver-460-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

Find the driver from the above information, and then find the recommend ed driver, nvidia-driver-470. Considering the ubuntu server, I finally chose NVIDIA driver 470 server.
Execute the command to install the driver: sudo apt install NVIDIA driver-470-server
After the installation is completed, execute NVIDIA SMI to output the gpu monitoring interface, and the driver installation is successful! From the monitoring information, we can see that cuda version is 11.4, so we also installed this version when installing cuda toolkit.

2 install cuda

stay https://developer.nvidia.com/cuda-toolkit-archive The corresponding version was found in. Here, we use the version of 11.4 and the installation form of runfile.
Directly enter the following command:

wget https://developer.download.nvidia.com/compute/cuda/11.4.0/local_installers/cuda_11.4.0_470.42.01_linux.run
sudo sh cuda_11.4.0_470.42.01_linux.run

Note that when you are reminded that you have installed the driver, continue directly. When selecting the installation content, be sure to cancel the x in front of the driver, because we have installed the driver!!!
Restart after installation, enter nvcc -V to display relevant information, and the installation is successful!

3. Solve the problem that the server cannot be connected after ssh for a period of time

After installing the driver, I found that the server could not be connected to ssh after a period of time. I had to restart it. Later, after reading the server log, I found that the server set automatic suspension.
Enter the command: systemctl status sleep.target
Output information:

● sleep.target - Sleep
Loaded: loaded (/lib/systemd/system/sleep.target; static; vendor preset: enabled)
Active: inactive (dead)
Docs: man:systemd.special(7)

You can find loaded. This indicates that automatic sleep is enabled.
We enter the command: sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target to disable it.
Enter the command: systemctl status sleep.target
Output information:

● sleep.target
Loaded: masked (Reason: Unit sleep.target is masked.)
Active: inactive (dead)

Description: Disabled successfully!
Remind again:
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
Note that this article is only applicable to ubuntu server and does not require a graphical interface. There is no special consideration and verification for the graphical interface! Readers who rely on graphical operation interface should use it with caution!
If the reader relies on the graphical interface, this article does not carry out special consideration and verification. It is uncertain whether there will be a problem. Please pay attention to it and use it in combination with other tutorials!

4 references

https://blog.csdn.net/qq_34387533/article/details/116011839
https://www.cnblogs.com/pprp/p/9430836.html
https://zhuanlan.zhihu.com/p/393152883

Topics: Ubuntu