以下是正在测试的方案,
Job:Spark SQK作业是用Scala编写的,并且运行在1TB TPCDS BENCHMARK DATA上,该数据采用镶木地板,活泼的格式和在其上创建的蜂巢表 .
集群经理:Kubernetes
Spark sql配置:
第1组:
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 15g
spark.executor.memory 15g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30
第2集:
spark.executor.heartbeatInterval 20s
spark.executor.cores 4
spark.driver.cores 4
spark.driver.memory 11g
spark.driver.memoryOverhead 4g
spark.executor.memory 11g
spark.executor.memoryOverhead 4g
spark.cores.max 220
spark.rpc.numRetries 5
spark.rpc.retry.wait 5
spark.network.timeout 1800
spark.sql.broadcastTimeout 1200
spark.sql.crossJoin.enabled true
spark.sql.starJoinOptimization true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenodeHA/tmp/spark-history
spark.sql.codegen true
spark.kubernetes.allocation.batch.size 30
使用Kryoserialiser并且 spark.kryoserializer.buffer.mb
值为64mb . 正在使用spark.executor.instances = 50提交参数生成50个执行程序 .
观察到的问题:
Spark SQL作业突然终止,驱动程序,执行程序被随机杀死 . 驱动程序和执行程序pod突然被杀死了 .
在不同的运行中找到几个不同的堆栈跟踪,
堆栈跟踪1:
"2018-05-10 06:31:28 ERROR ContextCleaner:91 - Error cleaning broadcast 136
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)"
附件:StackTrace1.txt
堆栈跟踪2:
"org.apache.spark.shuffle.FetchFailedException: Failed to connect to /192.178.1.105:38039^M
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:442)^M
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:418)"
附件:StackTrace2.txt
堆栈跟踪3:
"18/05/10 11:21:17 WARN KubernetesTaskSetManager: Lost task 3.0 in stage 48.0 (TID 16486, 192.178.1.35, executor 41): FetchFailed(null, shuffleId=29, mapId=-1, reduceId=3, message=^M
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 29^M
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)^M
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)"
附件:StackTrace3.txt
堆栈跟踪4:
"ERROR KubernetesTaskSchedulerImpl: Lost executor 11 on 192.178.1.123: Executor lost for unknown reasons."
这是不断重复,直到执行程序完全没有任何堆栈跟踪 .
另外,我们看到18/05/11 07:23:23 INFO DAGScheduler:失败:Set()这是什么意思?什么都不对或说失败的设置是空的,这意味着没有失败?
尝试了观察或更改: - 跨 Actuator 监视内存和CPU利用率,但没有一个达到限制 .
-
根据一些读数和建议
spark.network.timeout
从600增加到1800,但没有帮助 . -
此外,驱动程序和执行程序内存开销在配置的第1组中保持默认值,它是0.1 * 15g = 1.5gb . 此值也明确增加到4gb,并将驱动程序和执行程序内存值从15gb减少到11gb,如第2组所示 . 这没有产生任何有 Value 的结果,同样的故障被观察到 .
Spark SQL用于运行查询,示例代码行:
val qresult = spark.sql(q)
qresult.show()
代码中没有进行手动重新分区 .