1台带2个GPU的电脑 . 在2个GPU上训练2个独立的CNN . 我使用以下来为GPU创建图形:
with tf.device('/gpu:%d' % self.single_gpu):
self._create_placeholders()
self._build_conv_net()
self._create_cost()
self._creat_optimizer()
训练循环不在th.device()下
在开始第一次CNN训练过程之后,例如使用GPU 1.然后,我开始使用GPU 0进行第二次CNN训练 . 我总是得到CUDA_ERROR_OUT_OF_MEMORY错误,并且无法启动第二次训练过程 .
可以在同一台PC上运行分配给2个GPU的2个独立培训任务吗?如果可能的话,我错过了什么?
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 164.06M (172032000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
W tensorflow / core / common_runtime / bfc_allocator.cc:274] ******* ____ ****************** _______________________________________________________________________ W tensorflow / core / common_runtime / bfc_allocator.cc :275]试图分配384.00MiB的内存不足 . 查看内存状态的日志 . 回溯(最近一次调用最后一次):文件"/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py",第1022行,在_do_call中返回fn(* args)文件"/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py",第1004行,在_run_fn状态,run_metadata)文件"/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/contextlib.py",第89行,在 exit 下(self.gen)文件"/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py" ,第466行,in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status))tensorflow.python.framework.errors_impl.InternalError:未初始化Dst张量 . [[节点:_recv_inputs / input_placeholder_0 / _7 = _Revvclient_terminated = false,recv_device = "/job:localhost/replica:0/task:0/gpu:2",send_device = "/job:localhost/replica:0/task:0/cpu:0",send_device_incarnation = 1,tensor_name = "edge_3__recv_inputs/input_placeholder_0",tensor_type = DT_FLOAT,_device = "/job:localhost/replica:0/task:0/gpu:2"]] [[节点:平均值/ _15 = _Recvclient_terminated = false ,recv_device = "/job:localhost/replica:0/task:0/cpu:0",send_device = "/job:localhost/replica:0/task:0/gpu:2",send_device_incarnation = 1,tensor_name = "edge_414_Mean",tensor_type = DT_FLOAT,_device = "/job:localhost/replica:0/task:0/cpu:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mg_model_nvidia_gpu.py", line 491, in <module>
main()
File "mg_model_nvidia_gpu.py", line 482, in main
nvidia_cnn.train(data_generator, train_data, val_data)
File "mg_model_nvidia_gpu.py", line 307, in train
self.keep_prob: self.train_config.keep_prob})
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: _recv_inputs/input_placeholder_0/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]()]]
[[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
1 回答
默认情况下,TensorFlow预先分配它有权访问的GPU设备的整个内存 . 因此,第二个进程没有可用内存 .
您可以使用
config.gpu_options
控制此分配:或者您可以使用
os.environ["CUDA_VISIBLE_DEVICES"]
将您的两个进程归因于另一张卡 .