也就是说,我已经空出来了这个错误 . 我们是spark,hadoop和yarn的新手,但是我们正在努力推出的工作中找不到任何错误 . 请参阅下面的错误 .

**这是我们看到的间歇性问题 . 我们可以开始工作一次并且工作正常,下一次迭代我们必须启动它3次才能运行它 . 已经尝试在两次运行之间等待1s到1天,没有区别 .

2015-12-22 11:37:57,163 WARN nodemanager.DefaultContainerExecutor(DefaultContainerExecutor.java:launchContainer(223)) - 从容器container_e15_1449773992897_0324_01_000001退出代码是:11 2015-12-22 11:37:57,163 WARN nodemanager.DefaultContainerExecutor(DefaultContainerExecutor) .java:launchContainer(229)) - 容器启动与容器ID的异常:container_e15_1449773992897_0324_01_000001并退出代码:11 ExitCodeException exitCode = 11:at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)at org .apache.hadoop.util.Shell.run(Shell.java:487)org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:753)org.apache.hadoop.yarn.server.nodemanager .defaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)位于org.apache.hadoop.yarn.server.nodemanager的org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) .containermanager.launcher.ContainerLaunch.call(ContainerLaunch.ja va:82)java.util.concurrent.FutureTask.run(FutureTask.java:262)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor $ Worker.run (ThreadPoolExecutor.java:615)在java.lang.Thread.run(Thread.java:745)

我们在哪里:

  • Pyspark工作在10个节点基于Hortonworks的hadoop / spark集群(2名经理和8名 Worker )

  • 代码从HDFS读取文件并聚合数据(发生错误时既保存到s3又保存回hdfs

  • 工作提交:spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 --master yarn-cluster --driver-memory 30G --num-executors 8 --executor-cores 10 --executor -memory 50G request_test.py arg1 arg2

  • spark版本1.4.1,hadoop 2.3.2 .
    对sparkcontext的

  • 配置:

conf = conf.set(“spark.driver.maxResultSize”,“8g”)conf = conf.set(“spark.akka.frameSize”,“1000”)conf = conf.set(“spark.shuffle.blockTransferService” ,“nio”)conf = conf.set(“mapreduce.input.fileinputformat.input.dir.recursive”,“true”)conf = conf.set(“spark.hadoop.validateOutputSpecs”,“false”)conf = conf .set(“spark.yarn.executor.memoryOverhead”,“3000”)conf = conf.set(“spark.yarn.driver.memoryOverhead”,“3000”)

其中一些是后来的试验,以弥补错误

我们在执行并从日志中提取错误时已将问题隔离到工作节点 . Hadoop将作业显示为Finished with Failed状态,spark日志显示类似的退出代码11.我们一直在搜索并且无法找到退出代码及其含义周围的任何内容 . 我们一直在挖掘源代码以追踪错误并将继续 . 与此同时,我希望有人之前见过这个并且可以提供帮助!

EDIT 又一个试验spark-submit --verbose --packages org.apache.hadoop:hadoop-aws:2.7.1 --master yarn-cluster --driver-memory 12G --num-executors 8 --executor-cores 8 --executor-memory 25G request_test.py arg1 arg2

详细输出:解析参数:master yarn-cluster deployMode null executorMemory 25G executorCores 8 totalExecutorCores null propertiesFile /usr/hdp/2.3.2.0-2950/spark/conf/spark-defaults.conf driverMemory 12G driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions - Dhdp.version = 2.3.2.0-2950监督错误队列null numExecutors 8个文件null pyFiles null archives null mainClass null primaryResource文件:/datavol/oozie/data-science/request_test.py name request_test.py childArgs [20151227 MoPub] jars null packages org.apache.hadoop:hadoop-aws:2.7.1 repositories null verbose true使用的Spark属性,包括通过--conf指定的属性以及属性文件/usr /hpp.3.3.2.0-2950/spark/conf/spark中指定的属性-defaults.conf:spark.yarn.queue - >默认spark.history.kerberos.principal - > none spark.driver.memory - > 12G spark.yarn.max.executor.failures - > 3 spark.yarn.historyServer.address - > manager01.local:18080 spark.history.ui.port - > 18080 spark.yarn.services - > org.apache.spark.deploy.yarn.history.YarnHistoryServicespark.history.provider - > org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.yarn.applicationMaster.waitTries - > 10 spark.yarn.scheduler.heartbeat.interval-ms - > 5000 spark.yarn.executor . memoryOverhead - > 384 spark.yarn.submit.file.replication - > 3 spark.driver.extraJavaOptions - > -Dhdp.version = 2.3.2.0-2950 spark.yarn.containerLauncherMaxThreads - > 25 spark.yarn.driver.memoryOverhead - > 384 spark.history.kerberos.keytab - > none spark.yarn.am.extraJavaOptions - > -Dhdp.version = 2.3.2.0-2950 spark.yarn.preserve.staging.files - > false系统属性:spark.yarn.queue - >默认spark.history.kerberos.principal - > none spark.executor.memory - > 25G spark.driver.memory - > 12G spark.yarn.max.executor.failures - > 3 spark.yarn.historyServer.address - > manager01.local:18080 spark.history.ui.port - > 18080 spark.yarn.services - > org.apache.spark.deploy.yarn.history.YarnHistoryService spark.history.provider - > org.apache.spark.deploy . yarn.history.YarnHistoryProvider SPARK_SUBMIT - > tru e spark.submit.pyArchives - > pyspark.zip:py4j-0.8.2.1-src.zip spark.app.name - > request_test.py spark.yarn.applicationMaster.waitTries - > 10 spark.yarn.submit.file.replication - > 3 spark.yarn.executor.memoryOverhead - > 384 spark.yarn.scheduler.heartbeat.interval-ms - > 5000 spark.yarn.driver.memoryOverhead - > 384 spark.yarn.containerLauncherMaxThreads - > 25 spark.driver.extraJavaOptions - > -Dhdp.version = 2.3.2.0-2950 spark.history.kerberos.keytab - > none spark.yarn.am.extraJavaOptions - > -Dhdp.version = 2.3.2.0-2950 spark.yarn.preserve.staging.files - > false spark.master - > yarn-cluster

纱线应用程序日志中的错误对应于与上面的纱线 Worker 错误相同的 Worker /容器:

Container:container_e15_1449773992897_0449_01_000001 on worker04.local_45454 ========================================== ===================================== LogType:stderr日志上传时间:星期二12月29日07:54 :52 -0500 2015 LogLength:6861日志内容:SLF4J:类路径包含多个SLF4J绑定 . SLF4J:在[jar:file:/storage/hadoop/yarn/local/usercache/hdfs/filecache/5753/spark-assembly-1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950]中找到绑定.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J:在[jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12-1.7.10.jar中找到绑定!/ org / slf4j / impl / StaticLoggerBinder.class] SLF4J:有关说明,请参见http://www.slf4j.org/codes.html#multiple_bindings . SLF4J:实际绑定的类型为[org.slf4j.impl.Log4jLoggerFactory] 15/12/29 07:53:12 WARN spark.SparkConf:自Spark 1.3起,配置键'spark.yarn.applicationMaster.waitTries'已被弃用并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:13 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:14 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:14 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:14 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:自Spark 1.3起,配置键'spark.yarn.applicationMaster.waitTries'已被弃用并可能在将来删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . 15/12/29 07:53:15 WARN spark.SparkConf:配置密钥'spark.yarn.applicationMaster.waitTries'自Spark 1.3起已被弃用,并且可能在将来被删除 . 请改用新密钥'spark.yarn.am.waitTime' . [阶段0:>(166 56)/ 20180] 15/12/29 07:54:01错误nio.ConnectionManager:对应的SendingConnection连接到ConnectionManagerId(worker02.local,42675)未找到15/12/29 07:54:01 ERROR nio.ConnectionManager:未找到对应的SendingConnection到ConnectionManagerId(worker03.local,53260)15/12/29 07:54:01 ERROR nio.ConnectionManager:未找到ConnectionManagerId(worker06.local,55520)的相应SendingConnection 15/12 / 29 07:54:01 WARN nio.ConnectionManager:未清除所有连接LogType结束日期:stderr LogType:stdout日志上传时间:星期二12月29日07:54:52 -0500 2015 LogLength:2726日志内容:Traceback(最近一次通话) last):文件“request_test.py”,第292行,在reducedOutput.coalesce(numFileOutput).saveAsTextFile(outputPath)文件“/storage/hadoop/yarn/local/usercache/hdfs/appcache/application_1449773992897_0449/container_e15_1449773992897_0449_01_000001/pyspark.zip/ pyspark / rdd.py“,第1486行,在saveAsTextFile文件中”/ storage / hadoop / yarn / local / usercache / hdfs / appc疼痛/ application_1449773992897_0449 / container_e15_1449773992897_0449_01_000001 / py4j-0.8.2.1-src.zip / py4j / java_gateway.py “线路538,呼叫文件”/存储/ Hadoop的/纱/本地/ usercache / HDFS /应用程序缓存/ application_1449773992897_0449 / container_e15_1449773992897_0449_01_000001 / py4j -0.8.2.1-src.zip/py4j/protocol.py“,第300行,在get_return_value中py4j.protocol.Py4JJavaError:调用o92.saveAsTextFile时发生错误 . :org.apache.spark.SparkException:作业取消,因为SparkContext在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ cleanUpAfterSchedulerStop $ 1.适用(DAGScheduler.scala:736)关闭在org.apache.spark.scheduler . DAGScheduler $$ anonfun $ cleanUpAfterSchedulerStop $ 1.适用(DAGScheduler.scala:735)在scala.collection.mutable.HashSet.foreach(HashSet.scala:79)在org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala: 735)在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1475)在org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)在org.apache.spark.scheduler.DAGScheduler .stop(DAGScheduler.scala:1410)在org.apache.spark.SparkContext.stop(SparkContext.scala:1644)在org.apache.spark.SparkContext $$ anonfun $ 3.apply $ MCV $ SP(SparkContext.scala:559 )在org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308)在org.apache.spark.util.SparkShutdownHookManager $$ anonfun $ runAll $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用$ MCV $ SP(UTI ls.scala:2278)在org.apache.spark.util.SparkShutdownHookManager $$ anonfun $ runAll $ 1 $$ anonfun $ apply $ mcV $ sp $ 1.apply(Utils.scala:2278)at org.apache.spark.util . 在org.apache.spark.util.Utils $ .logUncaughtExceptions(Utils.scala:1772):SparkShutdownHookManager $$ anonfun $ runAll $ 1 $$ anonfun $ $应用MCV $ SP $ 1.适用(2278 Utils.scala)在org.apache.spark.util.SparkShutdownHookManager $$ anonfun $ runAll $ 1.apply $ mcV $ sp(Utils.scala:2278)org.apache.spark.util.SparkShutdownHookManager $$ anonfun $ runAll $ 1.apply(Utils.scala :2278)atg.apache.spark.util.SparkShutdownHookManager $$ anonfun $ runAll $ 1.apply(Utils.scala:2278)at sca.util.Try $ .apply(Try.scala:161)at org.apache.spark .util.SparkShutdownHookManager.runAll(Utils.scala:2278)org.apache.spark.util.SparkShutdownHookManager $$ anon $ 6.run(Utils.scala:2260)at org.apache.hadoop.util.ShutdownHookManager $ 1.run( ShutdownHookManager.java:54)LogType结束:stdout