我按照https://www.tensorflow.org/versions/r0.10/how_tos/distributed/中的示例创建了分布式TensorFlow,其中包含一个ps和两个worker . 我正在测试的机器中只有1个cpu和8个内核 .

InvalidArgumentError (see above for traceback): Cannot assign a device to node 'save/RestoreV2_14': Could not satisfy explicit device specification '/job:ps/task:0/device:CPU:0' because no devices matching that specification are registered in this process; available devices: /job:local/replica:0/task:0/cpu:0, /job:local/replica:0/task:1/cpu:0, /job:worker/replica:0/task:1/cpu:0
 [[Node: save/RestoreV2_14 = RestoreV2[dtypes=[DT_INT32], _device="/job:ps/task:0/device:CPU:0"](save/Const, save/RestoreV2_14/tensor_names, save/RestoreV2_14/shape_and_slices)]]

我已经将server.target作为参数传递给sv.prepare_or_wait_for_session

sess = sv.prepare_or_wait_for_session(server.target)

不知道是什么原因造成的?

我的培训代码是

parser = argparse.ArgumentParser(description='tensorflow')
parser.add_argument('--job_name', dest='job_name')
parser.add_argument('--task_index', dest='task_index', default=0)
args = parser.parse_args()

ps_hosts = ['localhost:2222']
worker_hosts = ['localhost:2223', 'localhost:2224']
job_name = args.job_name
task_index = int(args.task_index)





# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

# Create and start a server for the local task.
server = tf.train.Server(cluster, job_name=job_name, task_index=task_index)
if job_name == "ps":
    server.join()

elif job_name == "worker":
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % task_index,
        cluster=cluster)):
        total_input_features = len(train_x[0])
        x = tf.placeholder('float', [None, total_input_features])
        y = tf.placeholder('float')
        global_step = tf.Variable(0, name="global_step", trainable=False)
        is_chief = (task_index == 0)
        prediction = neural_network_model(x, total_input_features, n_nodes_hl1,
                                          first_layer_activation,
                                          n_nodes_hl2,
                                          second_layer_activation)
        total_loss = tf.reduce_mean(tf.square(prediction - y))
        optimizer = tf.train.AdamOptimizer()
        train_op = optimizer.minimize(total_loss, global_step=global_step)

        init_op = tf.initialize_all_variables()

        sv = tf.train.Supervisor(
            is_chief=is_chief,
            logdir="/tmp/train_logs",
            init_op=init_op,
            global_step=global_step)

        print '******** ALL CREATED ********'


       # The supervisor takes care of session initialization, restoring from
       # a checkpoint, and closing when done or an error occurs.

        with sv.managed_session(server.target) as sess:

            # Loop until the supervisor shuts down or 1000000 steps have completed.
            step = 0
            while not sv.should_stop() and step < 1000000:
                # Run a training step asynchronously.
                # See `tf.train.SyncReplicasOptimizer` for additional details on how to
                # perform *synchronous* training.

                train_feed = {x: train_x, y: train_y}
                _, step = sess.run([train_op, global_step], feed_dict=train_feed)
                if step % 100 == 0:
                    print "Done step %d" % step

        sv.stop()

完整的堆栈跟踪是:

Traceback(最近一次调用最后一次):文件“.... / PycharmProjects / SparkProject / server_client_updated.py”,第162行,用sv.managed_session(server.target)作为sess:File“/ System / Library / Frameworks / Python.framework / Versions / 2.7 / lib / python2.7 / contextlib.py“,第17行,在enter中返回self.gen.next()文件”.... / tensorflow / lib / python2.7 / site-packages /tensorflow/python/training/supervisor.py“,第973行,在managed_session self.stop(close_summary_writer = close_summary_writer)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / supervisor.py“,第801行,在stop stop_grace_period_secs = self._stop_grace_secs)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / coordinator.py“,第386行, join six.reraise(* self._exc_info_to_raise)file“.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / supervisor.py”,第962行,在managed_session start_standard_services = start_standard_services)文件” .... / tensorflow / lib中/ python2.7 /站点包s / tensorflow / python / training / supervisor.py“,第726行,在prepare_or_wait_for_session中max_wait_secs = max_wait_secs)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / session_manager.py “,第384行,在wait_for_session sess中)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / session_manager.py“,第467行,在_try_run_local_init_op sess.run中(自我 . local_init_op)文件“.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / client / session.py”,第767行,运行run_metadata_ptr)文件“/ Users / jabermo / tensorflow / lib / python2.7 / site-packages / tensorflow / python / client / session.py“,第965行,在_run feed_dict_string,options,run_metadata中)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow /python/client/session.py“,第1015行,在_do_run target_list,options,run_metadata中)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / client / session.py“ ,第1035行,在_do_call中提升类型(e)(node_def,op,message)tensorflow.py thon.framework.errors_impl.InvalidArgumentError:无法将设备分配给节点'save / RestoreV2_14':无法满足显式设备规范'/ job:ps / task:0 / device:CPU:0'因为没有注册匹配该规范的设备在这个过程中;可用设备:/ job:local / replica:0 / task:0 / cpu:0,/ job:local / replica:0 / task:1 / cpu:0,/ job:worker / replica:0 / task:1 / cpu:0 [[Node:save / RestoreV2_14 = RestoreV2 [dtypes = [DT_INT32], device =“/ job:ps / task:0 / device:CPU:0”](save / Const,save / RestoreV2_14 / tensor_names,save / RestoreV2_14 / shape_and_slices)]]由op u'save / RestoreV2_14'引起,定义于:文件“.... / PycharmProjects / SparkProject / server_client_updated.py”,第127行,在global_step = global_step)文件“.... /tensorflow/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py“,第313行,在init self._init_saver(saver = saver)文件”.... / tensorflow / lib / python2 . 7 / site-packages / tensorflow / python / training / supervisor.py“,第459行,在_init_saver saver = saver_mod.Saver()文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / saver.py“,第1051行,在init self.build()文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / saver.py“,line 1081,在build restore_sequentially = self._restor中e_sequentially)文件“.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / saver.py”,第675行,在build restore_sequentially,reshape)文件“.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / training / saver.py“,第402行,_AddRestoreOps张量= self.restore_op(filename_tensor,saveable,preferred_shard)文件”.... / tensorflow / lib / python2 .7 / site-packages / tensorflow / python / training / saver.py“,第242行,在restore_op [spec.tensor.dtype]中)[0])文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / ops / gen_io_ops.py“,第668行,在restore_v2中dtypes = dtypes,name = name)文件”.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python /framework/op_def_library.py“,第763行,在apply_op中op_def = op_def)文件“.... / tensorflow / lib / python2.7 / site-packages / tensorflow / python / framework / ops.py”,第2395行,在create_op original_op = self._default_original_op,op_def = op_def)文件“... ./tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py“,第1264行,在init self._traceback = extract_stack()InvalidArgumentError(参见上面的回溯):无法将设备分配给节点'save / RestoreV2_14':无法满足显式设备规范'/ job:ps / task:0 / device:CPU:0',因为在此过程中没有注册与该规范匹配的设备;可用设备:/ job:local / replica:0 / task:0 / cpu:0,/ job:local / replica:0 / task:1 / cpu:0,/ job:worker / replica:0 / task:1 / cpu:0 [[Node:save / RestoreV2_14 = RestoreV2 [dtypes = [DT_INT32], device =“/ job:ps / task:0 / device:CPU:0”](save / Const,save / RestoreV2_14 / tensor_names,save / RestoreV2_14 / shape_and_slices)]]