我使用2个Ubuntu服务器来运行分布式tensorflow . 每个服务器安装tensorflow 0.8.0 .

我首先在server1上启动ps服务器:```ubuntu @ i-mxdcqm20:/ data1T5 / org_models / inception $ sudo bazel-bin / inception / imagenet_distributed_train \

--job_name ='ps'\ --task_id = 0 \ --ps_hosts = '43 .254.55.221:2222'\ --worker_hosts = '61 .160.41.85:2222'``,

日志显示:

INFO:tensorflow:PS hosts are: ['43.254.55.221:2222'] INFO:tensorflow:Worker hosts are: ['61.160.41.85:2222'] I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222} I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {61.160.41.85:2222} I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222

当我运行 sudo netstat -tunlp 时,服务器实际上正在侦听端口2222:

tcp6 0 0 :::2222 :::* LISTEN 3525/python

但是当我在server2上启动worker时,它仍然报告无法连接: E0722 10:35:01.142377237 4045 tcp_client_posix.c:191] failed to connect to 'ipv4:43.254.55.221:2222': timeout occurred

我正在根据自述文件运行代码here并且我没有更改任何代码 .