我对分布式TensorFlow有疑问 . 为了理解它的行为,我在以下链接(https://gist.github.com/wsjeon/12bded1e3c4f81c775622f72e74c007b)中创建代码 . 有两个问题 .

  • 对于上面的代码,有时我得到错误,有时它有效 . 当它不起作用时,我收到以下错误消息:
Traceback (most recent call last):
  File "main.py", line 39, in 
    _, step = sess.run([assign_op, global_step])
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/home/wsjeon/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: /job:worker/replica:0/task:0/gpu:0 unknown device.
     [[Node: local/add_S3 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/cpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=4764918041242746699, tensor_name="edge_7_local/add", tensor_type=DT_INT32, _device="/job:ps/replica:0/task:0/cpu:0"]()]]

我无法弄清楚为什么会这样 .

  • 我使用了两个"workers"作为全局计数器 . 但是,我发现有些数字是重复的 . 我怎样才能解决这个问题?