首页 文章

OOM运行resnet模型tensorflow时出错

提问于
浏览
1

我在EC2 g2(NVIDIA GRID K520)实例上运行https://github.com/tensorflow/models/blob/master/resnet/resnet_main.py中的resnet模型并看到OOM错误 . 我已经尝试了各种组合,删除使用GPU的代码,前缀为CUDA_VISIBLE_DEVICES = '0',并将batch_size减少到64.我仍然无法启动培训 . 你能帮我吗?

W tensorflow / core / common_runtime / bfc_allocator.cc:270] ********************** x ************* ************************************************** ************ xx W tensorflow / core / common_runtime / bfc_allocator.cc:271]尝试分配196.00MiB时内存不足 . 查看内存状态的日志 . W tensorflow / core / framework / op_kernel.cc:936]资源耗尽:OOM在分配张量与形状[64,16,224,224] E tensorflow / core / client / tensor_c_api.cc:485] OOM时分配张量与形状[64,16,224,224 ] [[节点:unit_1_2 / sub1 / conv1 / Conv2D = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“SAME”,strides = [1,1,1,1],use_cudnn_on_gpu = true,_device =“ / job:localhost / replica:0 / task:0 / gpu:0“](unit_1_2 / residual_only_activation / leaky_relu,unit_1_2 / sub1 / conv1 / DW / read)]] [[Node:train_step / update / _1561 = _Recvclient_terminated = false ,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0”,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0”,send_device_incarnation = 1,tensor_name =“ edge_10115_train_step / update“,tensor_type = DT_FLOAT,_device =”/ job:localhost / replica:0 / task:0 / cpu:0“]] Traceback(最近一次调用最后一次):文件”./resnet_main.py“,第203行,在tf.app.run()文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py”,第30行,在运行sys.exit(main(sys . ARGV) )文件“./resnet_main.py”,第197行,在主列车(hps)文件“./resnet_main.py”,第82行,在列车feed_dict = {model.lrn_rate:lrn_rate})文件“/ usr / local / lib / python2.7 / dist-packages / tensorflow / python / client / session.py“,第382行,运行run_metadata_ptr)文件”/usr/local/lib/python2.7/dist-packages/tensorflow/python/client /session.py“,第655行,在_run feed_dict_string,options,run_metadata中)文件”/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py“,第723行,在_do_run中target_list,options,run_metadata)文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”,第743行,在_do_call中引发类型(e)(node_def,op,message )tensorflow.python.framework.errors.ResourceExhaustedError:分配具有形状的张量时的OOM [64,16,224,224] [[节点:unit_1_2 / sub1 / conv1 / Conv2D = Conv2D [T = DT_FLOAT,data_format =“NHWC”,padding =“SAME” “,strides = [1,1,1,1],use_cudnn_on_gpu = true,device =”/ job:localhost / replica:0 / task:0 / gpu:0“](unit_1_2 / residual_only activation / leaky_relu,unit_1_2 / sub1 / conv1 / DW / read)]] [[节点:train_step / update / _1561 = _Recvclient_terminated = false,recv_device =“/ job:localhost / replica:0 / task:0 / cpu:0” ,send_device =“/ job:localhost / replica:0 / task:0 / gpu:0”,send_device_incarnation = 1,tensor_name =“edge_10115_train_step / update”,tensor_type = DT_FLOAT,_device =“/ job:localhost / replica:0 /任务:0 / cpu:0“]]由op u'unit_1_2 / sub1 / conv1 / Conv2D'引起,定义于:文件”./resnet_main.py“,第203行,在tf.app.run()文件中”/ usr / local / lib / python2.7 / dist-packages / tensorflow / python / platform / app.py“,第30行,在运行sys.exit(main(sys.argv))文件”./resnet_main.py“中,第197行,在主列车(hps)文件“./resnet_main.py”,第64行,在火车上model.build_graph()文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第59行,在build_graph self._build_model()文件“/home/ubuntu/indu/tf-benchmark/resnet/resnet_model.py”,第94行,在_build_model x = res_func(x,filters [1],filters [1],self ._stride_arr(1),False)文件“/ home / ubuntu / indu / tf -benchmark / resnet / resnet_model.py“,第208行,_residual x = self._conv('conv1',x,3,in_filter,out_filter,stride)文件”/ home / ubuntu / indu / tf-benchmark / resnet / resnet_model.py“,第279行,在_conv中返回tf.nn.conv2d(x,内核,strides,padding ='SAME')文件”/usr/local/lib/python2.7/dist-packages/tensorflow/python/ ops / gen_nn_ops.py“,第394行,在conv2d中data_format = data_format,name = name)文件”/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py“,第703行,在apply_op op_def = op_def)文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”,第2310行,在create_op original_op = self._default_original_op,op_def = op_def)在init self._traceback = _extract_stack()中输入文件“/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”,第1232行

1 回答

  • 0

    该NVIDIA GRID K520拥有8GB内存(link) . 我已经成功地在具有12GB内存的NVIDIA GPU上训练了ResNet模型 . 正如错误所示,TensorFlow尝试将所有网络权重放入GPU内存并失败 . 我相信你有几个选择:

    • 仅在CPU上进行训练,如注释中所述,假设您的CPU内存超过8GB . 不建议这样做 .

    • 使用较少的参数训练不同的网络 . 自Resnet以来已经发布了几个网络,例如Inception-v4, Inception-ResNet,参数更少,准确性更高 . 此选项无需任何费用!

    • 购买内存更多的GPU . 如果你有钱,最简单的选择 .

    • 购买另一台具有相同内存的GPU,并将网络的下半部分训练为一部分,将网络的上半部分训练为另一部分 . GPU之间通信的困难使得该选项不太理想 .

    我希望这可以帮助您和其他遇到类似内存问题的人 .

相关问题