我正在使用Ubuntu 14.04 LTS运行AWS EC2 g2.2xlarge实例.我想在训练我的TensorFlow模型时观察GPU的利用率.我试图运行'nvidia-smi'时遇到错误.
ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls nvidia-bug-report.sh nvidia-debugdump nvidia-xconfig nvidia-cuda-mps-control nvidia-persistenced nvidia-cuda-mps-server nvidia-smi ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia ii nvidia-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-346 ii nvidia-346-dev 346.46-0ubuntu1 amd64 NVIDIA binary Xorg driver development files ii nvidia-346-uvm 346.96-0ubuntu0.0.1 amd64 Transitional package for nvidia-346 ii nvidia-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-375 ii nvidia-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 375.39 ii nvidia-375-dev 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary Xorg driver development files ii nvidia-modprobe 375.26-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files ii nvidia-opencl-icd-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-opencl-icd-352 ii nvidia-opencl-icd-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-opencl-icd-375 ii nvidia-opencl-icd-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime ii nvidia-settings 375.26-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ $ inxi -G Graphics: Card-1: Cirrus Logic GD 5446 Card-2: NVIDIA GK104GL [GRID K520] X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X $ lspci -k | grep -A 2 -E "(VGA|3D)" 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 Subsystem: XenSource, Inc. Device 0001 Kernel driver in use: cirrus 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1) Subsystem: NVIDIA Corporation Device 1014 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
我按照这些说明安装了CUDA 7和cuDNN:
$sudo apt-get -q2 update $sudo apt-get upgrade $sudo reboot
================================================== =====================
重新启动后,通过运行'$ sudo update-initramfs -u'来更新initramfs
现在,请编辑/etc/modprobe.d/blacklist.conf文件以将黑名单列入黑名单.在编辑器中打开文件,并在文件末尾插入以下行.
blacklist nouveau blacklist lbm-nouveau options nouveau modeset = 0 alias nouveau off alias lbm-nouveau off
保存并退出文件.
现在安装构建必备工具并更新initramfs并重新启动,如下所示:
$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential $sudo update-initramfs -u $sudo reboot
================================================== ======================
重新启动后,运行以下命令安装Nvidia.
$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run $sudo chmod 700 ./cuda_7.0.28_linux.run $sudo ./cuda_7.0.28_linux.run $sudo update-initramfs -u $sudo reboot
================================================== ======================
现在系统已启动,请运行以下命令验证安装.
$sudo modprobe nvidia $sudo nvidia-smi -q | head`enter code here`
您应该看到像'nvidia.png'这样的输出.
现在运行以下命令.$
cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery $make $./deviceQuery
但是,'nvidia-smi'仍然没有显示GPU活动,而Tensorflow是训练模型:
ubuntu@ip-10-0-1-48:~$ ipython Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32) Type "copyright", "credits" or "license" for more information. IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: import tensorflow as tf I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally ubuntu@ip-10-0-1-48:~$ nvidia-smi Thu Mar 30 05:45:26 2017 +------------------------------------------------------+ | NVIDIA-SMI 346.46 Driver Version: 346.46 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID K520 Off | 0000:00:03.0 Off | N/A | | N/A 35C P0 38W / 125W | 10MiB / 4095MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
nuicca.. 25
我通过从BIOS禁用安全启动控制,解决了"我的ASUS笔记本电脑与GTX 950m和Ubuntu 18.04无法与NVIDIA驱动程序通信的NVIDIA-SMI失败".
我通过从BIOS禁用安全启动控制,解决了"我的ASUS笔记本电脑与GTX 950m和Ubuntu 18.04无法与NVIDIA驱动程序通信的NVIDIA-SMI失败".
我在使用K80 GPU的Google Compute Engine中的Ubuntu 16.04(Linux 4.14内核)上遇到了同样的错误.我将内核升级到4.14并且问题解决了.以下是我将Linux内核从4.13升级到4.14的方法:
Step 1: Check the existing kernel of your Ubuntu Linux: uname -a Step 2: Ubuntu maintains a website for all the versions of kernel that have been released. At the time of this writing, the latest stable release of Ubuntu kernel is 4.15. If you go to this link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will see several links for download. Step 3: Download the appropriate files based on the type of OS you have. For 64 bit, I would download the following deb files: wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers- 4.15.0-041500_4.15.0-041500.201802011154_all.deb wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers- 4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image- 4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb Step 4: Install all the downloaded deb files: sudo dpkg -i *.deb Step 5: Reboot your machine and check if the kernel has been updated by: uname -a
您应该看到您的内核已经升级,并且希望nvidia-smi能够正常工作.