以下是正在测试的方案,

Job:Spark SQK作业是用Scala编写的,并且运行在1TB TPCDS BENCHMARK DATA上,该数据采用镶木地板,活泼的格式和在其上创建的蜂巢表 .

集群经理:Kubernetes

Spark sql配置:

第1组:

spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 15g
spark.executor.memory 15g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30

第2集:

spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 11g
spark.driver.memoryOverhead 4g
spark.executor.memory 11g
spark.executor.memoryOverhead 4g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30

使用Kryoserialiser并且 spark.kryoserializer.buffer.mb 值为64mb . 正在使用spark.executor.instances = 50提交参数生成50个执行程序 .

观察到的问题:

Spark SQL作业突然终止,驱动程序,执行程序被随机杀死 . 驱动程序和执行程序pod突然被杀死了 .

在不同的运行中找到几个不同的堆栈跟踪,

堆栈跟踪1:

"2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"

附件:StackTrace1.txt

堆栈跟踪2:

"org.apache.spark.shuffle.FetchFailedException: Failed to connect to /192.178.1.105:38039^M
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"

附件:StackTrace2.txt

堆栈跟踪3:

"18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, mapId=-1, reduceId=3, message=^M
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 29^M
        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M
        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)"

附件:StackTrace3.txt

堆栈跟踪4:

"ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor lost for unknown reasons."

这是不断重复,直到执行程序完全没有任何堆栈跟踪 .

另外,我们看到18/05/11 07:23:23 INFO DAGScheduler:失败:Set()这是什么意思?什么都不对或说失败的设置是空的,这意味着没有失败?

尝试了观察或更改: - 跨 Actuator 监视内存和CPU利用率,但没有一个达到限制 .

  • 根据一些读数和建议 spark.network.timeout 从600增加到1800,但没有帮助 .

  • 此外,驱动程序和执行程序内存开销在配置的第1组中保持默认值,它是0.1 * 15g = 1.5gb . 此值也明确增加到4gb,并将驱动程序和执行程序内存值从15gb减少到11gb,如第2组所示 . 这没有产生任何有 Value 的结果,同样的故障被观察到 .

Spark SQL用于运行查询,示例代码行:

val qresult = spark.sql(q)
    qresult.show()

代码中没有进行手动重新分区 .