首页 文章

Keras Multi GPU示例提供ResourceExhaustedError

提问于
浏览
1

所以我尝试在Keras上使用多个GPU . 当我使用示例程序运行training_utils.py(在training_utils.py代码中作为注释给出)时,我最终得到 ResourceExhaustedError . nvidia-smi 告诉我,四个GPU中只有一个正在运行 . 使用一个GPU适用于其他程序 .

  • TensorFlow 1.3.0

  • Keras 2.0.8

  • Ubuntu 16.04

  • CUDA / cuDNN 8.0 / 6.0

Question: 任何人都知道这里发生了什么?

控制台输出:

(......)

2017-10-26 14:39:02.086838:W tensorflow / core / common_runtime / bfc_allocator.cc:277] ************************* ************************************************** ************************ x 2017-10-26 14:39:02.086857:W tensorflow / core / framework / op_kernel.cc:1192]资源耗尽:OOM在分配张量形状时[128,55,55,256] Traceback(最近一次调用最后一次):文件“test.py”,第27行,在parallel_model.fit(x,y,epochs = 20,batch_size = 256)文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py”,第1631行,in fit validation_steps = validation_steps)文件“/ home / kyb / tensorflow / local / lib / python2.7 / site-packages / keras / engine / training.py“,第1213行,在_fit_loop outs = f(ins_batch)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages /keras/backend/tensorflow_backend.py“,第2331行,在调用** self.session_kwargs)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session . py“,第895行,在run run_metadata_ptr中)Fil e“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py”,第1124行,在_run feed_dict_tensor,options,run_metadata中)文件“/ home / kyb / tensorflow / local / lib / python2.7 / site-packages / tensorflow / python / client / session.py“,第1321行,在_do_run选项中,run_metadata)文件”/home/kyb/tensorflow/local/lib/python2.7 /site-packages/tensorflow/python/client/session.py“,第1340行,在_do_call引发类型(e)(node_def,op,message)tensorflow.python.framework.errors_impl.ResourceExhaustedError:OOM在分配具有形状的张量时[ 128,55,55,256] [[节点:replica_1 / xception / block3_sepconv2 / separable_conv2d = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“VALID”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“/ job:localhost / replica:0 / task:0 / gpu:1”](replica_1 / xception / block3_sepconv2 / separable_conv2d / depthwise,block3_sepconv2 / pointwise_kernel / read / _2103)]] [[节点:培训/ RMSprop /梯度/ replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBac kpropFilter / _4511 = Recvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0”, send_device_incarnation = 1,tensor_name =“edge_25380_training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter”,tensor_type = DT_FLOAT,device =“/ job:localhost / replica:0 / task:0 / cpu:0”]]由op u'replica_1 / xception / block3_sepconv2 / separable_conv2d'引起,定义于:文件“test.py”,第19行,在parallel_model = multi_gpu_model(model,gpus = 2)文件“/ home / kyb / tensorflow / local / lib /python2.7/site-packages/keras/utils/training_utils.py“,第143行,在multi_gpu_model outputs = model(inputs)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages/ keras / engine / topology.py“,第603行,在调用输出= self.call(输入,** kwargs)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine /topology.py“,第2061行,在调用output_tensors中, = self.run_internal_graph(输入,掩码) )文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py”,第2212行,在run_internal_graph output_tensors = _to_list(layer.call(computed_tensor,** kwargs)) )文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py”,第1221行,在调用dilation_rate = self.dilation_rate)文件“/ home / kyb / tensorflow /local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py“,第3279行,在separable_conv2d中data_format = tf_data_format)文件”/home/kyb/tensorflow/local/lib/python2.7/site- packages / tensorflow / python / ops / nn_impl.py“,第497行,在separable_conv2d name = name)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops .py“,第397行,在conv2d中data_format = data_format,name = name)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py“,line 767,在apply_op中op_def = op_def)文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflo w / python / framework / ops.py“,第2630行,在create_op original_op = self._default_original_op,op_def = op_def)文件”/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python /framework/ops.py“,第1204行,在init self._traceback = self._graph._extract_stack()#pylint:disable = protected-access ResourceExhaustedError(参见上面的回溯):OOM在分配具有形状的张量时[128,55] ,55,256] [[节点:replica_1 / xception / block3_sepconv2 / separable_conv2d =Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“VALID”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“/ job:localhost / replica:0 / task:0 / gpu:1“](replica_1 / xception / block3_sepconv2 / separable_conv2d / depthwise,block3_sepconv2 / pointwise_kernel / read / _2103)]] [[节点:training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter / _4511 = _Revvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0”,send_device_incarnation = 1,tensor_name = “edge_25380_training / RMSprop / gradients / replica_0 / xception / block10_sepconv2 / separable_conv2d_grad / Conv2DBackpropFilter”,tensor_type = DT_FLOAT,_device =“/ job:localhost / replica:0 / task:0 / cpu:0”]]

EDIT (Added example code):

import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np

num_samples = 1000
height = 224
width = 224
num_classes = 100

with tf.device('/cpu:0'):
    model = Xception(weights=None,
                     input_shape=(height, width, 3),
                     classes=num_classes)

parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
                   optimizer='rmsprop')

x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))

parallel_model.fit(x, y, epochs=20, batch_size=128)

1 回答

  • 2

    当在GPU上遇到OOM / ResourceExhaustedError时,我认为更改(减少) batch size 是首先尝试的正确选项 .

    对于不同的GPU,您可能需要基于GPU内存的不同批量大小 .

    最近我遇到了类似的问题,调整了很多做不同类型的实验 .

    这是question的链接(也包括一些技巧) .

    但是,在减小批量大小的同时,您可能会发现训练速度变慢 .

相关问题