我和Spark有问题 . 我有一个带有2个节点的Spark独立集群,

  • master: 121.*.*.22(hostname is iZ28i1niuigZ)

  • worker: 123.*.*.125(hostname is VM-120-50-ubuntu) .

我编辑了 slaves 文件并添加了 123.*.*.125 .

WebUI上没有工作人员信息:

WebUI image of spark master

执行启动脚本时,我看到以下消息:

spark@iZ28i1niuigZ:~/spark-2.0.1-bin-hadoop2.7$ sh sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.master.Master-1-iZ28i1niuigZ.out
123.*.*.125: starting org.apache.spark.deploy.worker.Worker, logging to /home/spark/spark-2.0.1-bin-hadoop2.7/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-VM-120-50-ubuntu.out

spark-env.sh 文件内容为:

export SPARK_MASTER_IP=121.*.*.22
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORDER_INSTANCES=1
export SPARK_WORKER_MEMORY=1g
export JAVA_HOME=/home/spark/jdk1.8.0_101

在worker上我可以看到以下输出:

Spark Command: /home/spark/jdk1.8.0_101/bin/java -cp /home/spark/spark-2.0.1-bin-hadoop2.7/conf/:/home/spark/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://iZ28i1niuigZ:7077 
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/11/30 20:04:56 INFO Worker: Started daemon with process name: 28287@VM-120-50-ubuntu
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for TERM
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for HUP
16/11/30 20:04:56 INFO SignalUtils: Registered signal handler for INT
16/11/30 20:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/30 20:04:56 INFO SecurityManager: Changing view acls to: spark
16/11/30 20:04:56 INFO SecurityManager: Changing modify acls to: spark
16/11/30 20:04:56 INFO SecurityManager: Changing view acls groups to:
16/11/30 20:04:56 INFO SecurityManager: Changing modify acls groups to:
16/11/30 20:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
16/11/30 20:04:57 INFO Utils: Successfully started service 'sparkWorker' on port 41544.
16/11/30 20:04:57 INFO Worker: Starting Spark worker 10.141.120.50:41544 with 1 cores, 1024.0 MB RAM
16/11/30 20:04:57 INFO Worker: Running Spark version 2.0.1
16/11/30 20:04:57 INFO Worker: Spark home: /home/spark/spark-2.0.1-bin-hadoop2.7
16/11/30 20:04:57 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/11/30 20:04:57 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://10.141.120.50:8081
16/11/30 20:04:57 INFO Worker: Connecting to master iZ28i1niuigZ:7077...
16/11/30 20:04:58 WARN Worker: Failed to connect to master iZ28i1niuigZ:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
        at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
        at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.jav    a:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to iZ28i1niuigZ/121.*.*.22:7077
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
        ... 4 more
Caused by: java.net.ConnectException: Connection refused: iZ28i1niuigZ/121.*.*.22:7077
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more
16/11/30 20:05:08 INFO Worker: Retrying connection to master (attempt # 1)
16/11/30 20:05:08 INFO Worker: Connecting to master iZ28i1niuigZ:7077...
16/11/30 20:05:08 WARN Worker: Failed to connect to master iZ28i1niuigZ:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
        at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout    .scala:75)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
        at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
        at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to iZ28i1niuigZ/121.*.*.22:7077
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
        ... 4 more
Caused by: java.net.ConnectException: Connection refused: iZ28i1niuigZ/121.*.*.22:7077
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more

主节点上的 /etc/hosts 如下所示:

127.0.0.1 localhost
127.0.1.1       localhost.localdomain   localhost
# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
10.251.33.226 iZ28i1niuigZ
123.*.*.125 VM-120-50-ubuntu

/etc/hosts 工作节点包含以下配置:

10.141.120.50 VM-120-50-ubuntu
127.0.0.1  localhost  localhost.localdomain
121.*.*.22 iZ28i1niuigZ

我无法理解为什么 Worker 无法连接到主人?

================================================== ======================更新:

我不能 telnet 123.*.*.125 7077 ,而我可以 telnet 123.*.*.125

执行命令时: iptables -L -n ,我看到以下消息:

Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination