我尝试过一系列不同的Tensorflow示例,它们在CPU上工作正常但在我尝试在GPU上运行时会产生相同的错误.一个小例子就是:
import tensorflow as tf # Creates a graph. a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) # Creates a session with log_device_placement set to True. sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) # Runs the op. print sess.run(c)
错误始终相同,CUDA_ERROR_OUT_OF_MEMORY:
I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcublas.so.7.0 locally I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcudnn.so.6.5 locally I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcufft.so.7.0 locally I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcuda.so locally I tensorflow/stream_executor/dso_loader.cc:101] successfully opened CUDA library libcurand.so.7.0 locally I tensorflow/core/common_runtime/local_device.cc:40] Local device intra op parallelism threads: 24 I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 0 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:0a:00.0 Total memory: 11.25GiB Free memory: 105.73MiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:103] Found device 1 with properties: name: Tesla K80 major: 3 minor: 7 memoryClockRate (GHz) 0.8235 pciBusID 0000:0b:00.0 Total memory: 11.25GiB Free memory: 133.48MiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:127] DMA: 0 1 I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 0: Y Y I tensorflow/core/common_runtime/gpu/gpu_init.cc:137] 1: Y Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:0a:00.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:702] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:0b:00.0) I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Allocating 105.48MiB bytes. E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 105.48M (110608384 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Check failed: gpu_mem != nullptr Could not allocate GPU device memory for device 0. Tried to allocate 105.48MiB Aborted (core dumped)
我想这个问题与我的配置有关,而不是这个小例子的内存使用情况.有谁有想法吗?
编辑:
我发现问题可能就像在同一GPU上运行作业的其他人一样简单,这可以解释少量的可用内存.在那种情况下:抱歉占用你的时间......
这里似乎有两个问题:
默认情况下,TensorFlow会在您创建时分配大部分(95%)可用GPU内存(在每个GPU设备上)tf.Session
.它使用一种启发式方法,为"系统"使用保留200MB的GPU内存,但如果可用内存量小于此值,则不会将其置于一边.
看起来你的GPU设备(105.73MiB和133.48MiB)上的GPU内存很少.这意味着TensorFlow将尝试分配可能为系统保留的内存,因此分配失败.
当您尝试运行此程序时,是否有可能运行另一个TensorFlow进程(或其他一些需要GPU的代码)?例如,具有开放会话的Python解释器 - 即使它不使用GPU - 将尝试分配几乎整个GPU内存.
目前,限制TensorFlow使用的GPU内存量的唯一方法是以下配置选项(来自此问题):
# Assume that you have 12GB of GPU memory and want to allocate ~4GB: gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))