我试图在Yarn上运行Pyspark,但是当我在控制台上键入任何命令时,我收到以下错误 .

我可以在本地和纱线模式下在Spark中运行scala shell . Pyspark在本地模式下运行正常,但在纱线模式下不起作用 .

操作系统:RHEL 6.x

Hadoop发行版:IBM BigInsights 4.0

Spark版本:1.2.1

WARN scheduler.TaskSetManager:阶段0.0中的丢失任务0.0(TID 0,工作):org.apache.spark.SparkException:来自python worker的错误:/ usr / bin / python:没有名为pyspark PYTHONPATH的模块是:/ mnt / sdj1 /hadoop/yarn/local/filecache/13/spark-assembly.jar(我的评论:此路径不存在于linux文件系统上,但逻辑数据节点上)java.io.DataInputStream.readInt(DataInputStream)java.io.EOFException .java:392)org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)at org . aplet.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)atg.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)org.apache.spark.api.python.PythonRDD.compute( PythonRDD.scala:70)位于org.apache.spark的Org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)atg.apache.spark.rdd.RDD.iterator(RDD.scala:247) .scheduler.ResultTask . 在org.apache.spark.scheduler.Task.run(Task.scala:56)的runTask(ResultTask.scala:61)org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:200)at java .util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)at java.lang.Thread.run(Thread.java:745)

我通过导出命令设置了SPARK_HOME和PYTHONPATH,如下所示

export SPARK_HOME=/path/to/spark
export PYTHONPATH=/path/to/spark/python/:/path/to/spark/lib/spark-assembly.jar

有人可以帮我解决这个问题吗?

答案:

经过一番挖掘后,我发现pyspark在Big Insights 4.0中确实存在一些问题 . 建议我们升级到BI 4.1 .