首页 文章

Spark Memory / worker问题以及正确的火花配置是什么?

提问于
浏览
2

我的spark群集中总共有6个节点 . 5个节点各有4个核心和32GB ram,其中一个节点(节点4)有8个核心和32GB ram .

所以我总共有6个节点--28个核心,192GB RAM . (我想使用一半的内存,但所有核心)

计划在群集上运行5个spark应用程序 .

我的spark_defaults.conf如下:

spark.master                     spark://***:7077
spark.eventLog.enabled           false
spark.driver.memory              2g
worker_max_heapsize              2g
spark.kryoserializer.buffer.max.mb      128
spark.shuffle.file.buffer.kb    1024
spark.cores.max                 4
spark.dynamicAllocation.enabled true

我想在每个节点上使用16GB max,并通过设置以下配置在每台机器上运行4个worker实例 . 所以,我期望(我的集群中有4个实例* 6个节点= 24个)工作人员 . 他们一起使用多达28个核心(全部)和96GB内存 .

我的spark-env.sh如下 .

export SPARK_WORKER_MEMORY=16g
export SPARK_WORKER_INSTANCES=4
SPARK_LOCAL_DIRS=/app/spark/spark-1.6.1-bin-hadoop2.6/local
SPARK_WORKER_DIR=/app/spark/spark-1.6.1-bin-hadoop2.6/work

但我的火花群已经开始了

Spark UI显示正在运行的工作人员..

Worker Id ? Address State   Cores   Memory
worker-node4-address    ALIVE   8 (1 Used)  16.0 GB (0.0 GB Used)
worker-node4-address    ALIVE   8 (1 Used)  16.0 GB (0.0 GB Used)
worker-node4-address    ALIVE   8 (1 Used)  16.0 GB (0.0 GB Used)
worker-node4-address    ALIVE   8 (0 Used)  16.0 GB (0.0 B Used)
worker-node4-address    ALIVE   8 (1 Used)  16.0 GB (0.0 GB Used)
worker-node1-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node1-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node1-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node1-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)

worker-node2-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node2-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node2-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node2-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)

worker-node3-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node3-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node3-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node3-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)

worker-node5-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node5-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node5-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node5-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)

worker-node6-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node6-address    ALIVE   4 (3 Used)  16.0 GB (0.0 GB Used)
worker-node6-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)
worker-node6-address    ALIVE   4 (0 Used)  16.0 GB (0.0 B Used)

但主UI正在显示(当没有应用程序运行时)Alive Workers:25个正在使用的核心:120个,0个正在使用的内存:400.0 GB总计,0 GB使用状态:ALIVE

当我期待24名 Worker (每个节点4个)时,为什么有25个? - 在node4上有1个额外的,它有8个核心 .

当我在每个节点上分配最大16GB时,为什么显示正在使用的内存:400.0 GB总计?

UI数据显示我有120个核心,因为我的群集中有28个核心?

你能告诉我我的系统应该有什么样的火花配置吗?

当我提交spark作业时,我应该指定多少个核心执行程序内存?

什么是spark.cores.max参数?是每个节点还是整个集群?

我用spart-submit配置运行了3个应用程序--executor-memory 2G --total-executor-cores 4我的应用程序中至少有一个给出了以下错误并且失败了 .

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
        at scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
        at scala.concurrent.forkjoin.ForkJoinPool.fullExternalPush(ForkJoinPool.java:1905)
        at scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1834)
        at scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
        at scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:120)
        at scala.concurrent.impl.Future$.apply(Future.scala:31)
        at scala.concurrent.Future$.apply(Future.scala:485)
        at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:232)
        at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:222)
        at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:87)
        at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:83)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
        at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:83)
        at org.apache.spark.deploy.rest.RestSubmissionClient$.run(RestSubmissionClient.scala:411)
        at org.apache.spark.deploy.rest.RestSubmissionClient$.main(RestSubmissionClient.scala:424)
        at org.apache.spark.deploy.rest.RestSubmissionClient.main(RestSubmissionClient.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:195)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

1 回答

  • 1

    据我所知,你应该只为每个节点启动一个Worker:

    http://spark.apache.org/docs/latest/hardware-provisioning.html

    仅当每个节点有超过200 GB-Ram时 . 但是每个节点没有200 GB-Ram . 你可以在只有4个核心的节点的spark-env.sh中设置它吗?

    export SPARK_EXECUTOR_CORES=4
    export SPARK_EXECUTOR_MEMORY=16GB
    export SPARK_MASTER_HOST=<Your Master-Ip here>
    

    在这个有8个核心的节点上:

    export SPARK_EXECUTOR_CORES=8
    export SPARK_EXECUTOR_MEMORY=16GB
    export SPARK_MASTER_HOST=<Your Master-Ip here>
    

    这在spark-defaults.conf中的主节点上:

    spark.driver.memory              2g
    

    我认为你应该尝试这个并注释掉其他的Konfigurations进行测试 . 这就是你想要的吗?您的群集现在应该总共使用96 GB和28个核心 . 您可以在没有 --executor-memory 2G --total-executor-cores 4 的情况下启动您的应用程序 . 但是没有错误的配置就可以发生 java.lang.OutOfMemoryError . 当您向驾驶员收集太多费用时也会发生这种情况 .

    是的,每个工作人员在您当前的配置中都有16 GB Ram . 然后25 Worker * 16 GB =总共400 GB .

相关问题