我有一台带有3台机器的DSE集群:1,2和3 .

当我向主人提交申请时,如果我理解得很好,就会发生这样的事情:

  • Master接收应用程序并分配资源

  • 驱动程序开始运行到已分配的工作程序

  • 驱动程序在群集的其他节点上运行Executor以共享工作负载

所以我们在这个集群中有这个配置:

  • 1是师父,有 Worker 1

  • 2是奴隶,已经工作2

  • 3是奴隶,有 Worker 3

当Spark为驱动程序选择worker 1(master)时,一切运行正常 . 但是当Spark决定将worker 2(slave)或worker 3(slave)分配给驱动程序时,它会尝试绑定master的ip并且每次都失败:

INFO  16:20:45  Changing view acls to: cassandra
INFO  16:20:45  Changing modify acls to: cassandra
INFO  16:20:45  SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra); users with modify permissions: Set(cassandra)
INFO  16:20:45  Slf4jLogger started
ERROR 16:20:46  failed to bind to /10.1.1.1:0, shutting down Netty transport
WARN  16:20:46  Service 'Driver' could not bind on port 0. Attempting port 1.
INFO  16:20:46  Slf4jLogger started
ERROR 16:20:46  failed to bind to /10.1.1.1:0, shutting down Netty transport
WARN  16:20:46  Service 'Driver' could not bind on port 0. Attempting port 1.

每个节点的配置都很简单:

export SPARK_LOCAL_IP="10.1.1.1" // or .2 or .3
export SPARK_PUBLIC_DNS="xx.xx.xx.xx"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=7080
export SPARK_DRIVER_HOST="10.1.1.1" // or .2 or .3
export SPARK_WORKER_INSTANCES=1
export SPARK_DRIVER_MEMORY="10G"

我试图在spark-defaults.conf中设置spark.driver.port,但它没有任何效果 .

这是提交电话:

/usr/bin/dse spark-submit --properties-file production.conf --master spark://10.1.1.1:7077 --deploy-mode cluster --class "com.company.SignalIO" aggregation.jar 2015-6-1-00:00:00 2015-6-2-00:00:00 signal_table

有任何想法吗?