为什么't I run tensorflow session on CPU while one GPU device'的内存全部分配？-Java 学习之路

从tensorflow网站（https://www.tensorflow.org/guide/using_gpu）我发现以下代码手动指定使用CPU而不是GPU：

# Creates a graph.
with tf.device('/cpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

我尝试在我的机器上运行它（带有4个GPU），并出现以下错误：

2018-11-05 10：02：30.636733：I tensorflow / core / common_runtime / gpu / gpu_device.cc：1392]找到具有属性的设备0：名称：GeForce GTX 1080 Ti major：6 minor：1 memoryClockRate（GHz）： 1.582 pciBusID：0000：18：00.0 totalMemory：10.92GiB freeMemory：10.76GiB 2018-11-05 10：02：30.863280：I tensorflow / core / common_runtime / gpu / gpu_device.cc：1392]找到具有属性的设备1：name： GeForce GTX 1080 Ti专业：6个未成年人：1个memoryClockRate（GHz）：1.582 pciBusID：0000：3b：00.0 totalMemory：10.92GiB freeMemory：10.76GiB 2018-11-05 10：02：31.117729：E tensorflow / core / common_runtime / direct_session .cc：158]内部：为CUDA设备初始化StreamExecutor序号为2：内部：对cuDevicePrimaryCtxRetain的调用失败：CUDA_ERROR_OUT_OF_MEMORY;报告的总内存：11721506816回溯（最近一次调用最后一次）：文件“./tf_test.py”，第10行，在sess = tf.Session（config = tf.ConfigProto（log_device_placement = True））文件“... / anaconda2 /lib/python2.7/site-packages/tensorflow/python/client/session.py“，第1566行，在init super（Session，self）.init（target，graph，config = config）文件”... / anaconda2 / lib / python2.7 / site-packages / tensorflow / python / client / session.py“，第636行，在init self._session = tf_session.TF_NewSession（self._graph._c_graph，opts）tensorflow.python.framework . errors_impl.InternalError：无法创建会话 .

似乎在我创建会话时，tensorflow尝试在所有设备上初始化流 Actuator . 不幸的是，我的同事现在正在使用其中一个GPU . 我希望他完全使用一个GPU不会阻止我使用其他设备（无论是GPU还是CPU），但似乎并非如此 .

有谁知道这个解决方法？也许要添加到配置中的东西？这是可以在tensorflow中修复的吗？

仅供参考......这是“gpustat -upc”的输出：

<my_hostname>  Mon Nov  5 10:19:47 2018
[0] GeForce GTX 1080 Ti | 36'C,   0 % |    10 / 11178 MB |
[1] GeForce GTX 1080 Ti | 41'C,   0 % |    10 / 11178 MB |
[2] GeForce GTX 1080 Ti | 38'C,   0 % | 11097 / 11178 MB | <my_colleague>:python2/148901(11087M)
[3] GeForce GTX 1080 Ti | 37'C,   0 % |    10 / 11178 MB |

1 回答

好的...所以在我的同事的帮助下，我有一个可行的解决方案 . 事实上，关键是对配置的修改 . 具体来说，这样的事情：

config.gpu_options.visible_device_list ='0'

这将确保tensorflow仅看到GPU 0 .

事实上，我能够运行以下内容：

#!/usr/bin/env python                                                                                                                                                                                                                        

import tensorflow as tf

with tf.device('/gpu:2'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.                                                                                                                                                                                   
config=tf.ConfigProto(log_device_placement=True)
config.gpu_options.visible_device_list = '0,1,3'
sess = tf.Session(config=config)
# Runs the op.                                                                                                                                                                                                                               
print(sess.run(c))

请注意，此代码实际上指定在GPU 2上运行（您可能记得它是已满的） . 这一点很重要...... GPU根据visible_device_list重新编号，所以在上面的代码中，当我们说“with gpu：2”时，这是指列表中的第3个GPU（'0,1,3） '），实际上是GPU 3.如果你试试这个可能会咬你：

#!/usr/bin/env python                                                                                                                                                                                                                        

import tensorflow as tf

with tf.device('/gpu:1'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.                                                                                                                                                                                   
config=tf.ConfigProto(log_device_placement=True)
config.gpu_options.visible_device_list = '1'
sess = tf.Session(config=config)
# Runs the op.                                                                                                                                                                                                                               
print(sess.run(c))

问题在于它在列表中寻找第二个GPU，但在可见列表中只有一个GPU . 您将得到的错误如下：

InvalidArgumentError（请参见上面的回溯）：无法为操作“a”分配设备：操作已明确分配给/ device：GPU：1但可用设备为[/ job：localhost / replica：0 / task：0 / device： CPU：0，/ job：localhost / replica：0 / task：0 / device：GPU：0] . 确保设备规范指的是有效设备 . [[节点：a = Constdtype = DT_FLOAT，value = Tensor，_device =“/ device：GPU：1”]]

我想在CPU上运行时必须指定一个GPU列表，这似乎很奇怪 . 我尝试使用空列表但失败了，所以如果所有4个GPU都在使用中，我就没有解决方法 . 其他人有更好的主意吗？

回复于 2024-05-02T23:52:54+08:00

为什么't I run tensorflow session on CPU while one GPU device'的内存全部分配？

1 回答

相关问题