所有版本:

spark-2.1.0-bin-hadoop2.7.tar.gz    
hadoop-2.7.3.tar.gz    
scala-2.12.6
PyCharm 2017.1.3
Anaconda3
windows 8.1

设置:

示例代码:

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

整个堆栈跟踪错误:

“D:\ Program Files(x86)\ Anaconda3 \ envs \ my_new_env_python35 \ python.exe”“D:/ pyProject / spark session / run-tests.py”使用Spark的默认log4j配置文件:org / apache / spark / log4j- defaults.properties将默认日志级别设置为“WARN” . 要调整日志记录级别,请使用sc.setLogLevel(newLevel) . 对于SparkR,请使用setLogLevel(newLevel) . 回溯(最近一次调用最后一次):文件“D:\ Program Files(x86)\ Anaconda3 \ lib \ runpy.py”,第183行,在_run_module_as_main mod_name,mod_spec,code = _get_module_details(mod_name,_Error)文件“D:\程序文件(x86)\ Anaconda3 \ lib \ runpy.py“,第109行,在_get_module_details导入(pkg_name)文件”“,第961行,在_find_and_load文件”“,第950行,在_find_and_load_unlocked文件”“,第646行,在_load_unlocked文件“”,第616行,在_load_backward_compatible文件“D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark__init __ . py”,第44行,在文件“”中,第961行,在_find_and_load文件“”中,第950行,在_find_and_load_unlocked文件“”,第646行,在_load_unlocked文件“”中,第616行,在_load_backward_compatible文件中“D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ context.py“,第36行,在文件”“,第961行,在_find_and_load文件”“,第950行,在_find_and_load_unlocked文件”“,第646行,在_load_unlocked文件”“,第616行,在_load_backward_compatible文件“D:\ spark-2.1.0-bi n-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ java_gateway.py“,第25行,在文件”D:\ Program Files(x86)\ Anaconda3 \ lib \ platform.py“,第886行,”系统节点发布版本机器处理器“)文件”D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py“,第393行,在namedtuple中TypeError:namedtuple()缺少3个必需的关键字参数:'verbose','rename'和'module'[阶段0:>(0 2)/ 2] 18/05/29 08:59:20错误执行者:任务0.0中的异常stage 0.0(TID 0)org.apache.spark.SparkException:Python工作者没有及时连接到org.apache.apache.apache上的org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138) . 位于org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)的api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala: 128)在org.apache.spark.rdd.R的org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的DD.computeOrReadCheckpoint(RDD.scala:323)位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)org .apache.spark.scheduler.Task.run(Task.scala:99)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor) .java:1149)java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)引起:java.net.SocketTimeoutException:接受定时java.net.DualStackPlainSocketImpl.waitForNewConnection(本地方法),java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135),java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409),java.net.PlainSocketImpl . 接受(PlainSocketImpl.java:199)java.net.ServerSocket.implAccept(ServerSocket.java:545)at java.net.ServerSocket.accept(ServerSocket.java:513)org.apache.spark.api.python . PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)... 12更多18/05/29 08:59:20 WARN TaskSetManager:阶段0.0中的丢失任务0.0(TID 0,localhost, Actuator 驱动程序):org.apache.spark . SparkException:Python工作者没有及时连接org.apache.apark.api.pyi.pyy.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138)org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala: 67)org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)atg.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)at org.apache.spark.api.python .PythonRDD.compute(PythonRDD.scala:63)位于org.apache.spark.rdd.RDD.iterator(RDD.scala:287)的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)at at Org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)atg.apache.spark.scheduler.Task.run(Task.scala:99)at org.apache.spark.executor.Executor $ TaskRunner . 在java.util.concurrent上运行(Executor.scala:282) . ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:748)引起:java.net.SocketTimeoutException:接受java.net.DualStackPocket上的java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法)中的超时时间(Java).AbstractPlainSocketImpl.cn上的java.net.DualStackPlainSocketImpl.smplAccept(DualStackPlainSocketImpl.java:135) . (AbstractPlainSocketImpl.java:401 )java.net.ServerSocket.implAccept(ServerSocket.java:545)java.net.ServerSocket.accept(ServerSocket.java:513)的java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)atg.apache .spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)... 12更多18/05/29 08:59:20错误TaskSetManager:阶段0.0中的任务0失败1次;中止作业Traceback(最近一次调用最后一次):文件“D:/ pyProject / spark session转化/run-tests.py”,第27行,在count = spark.sparkContext.parallelize(range(1,n 1),partitions) .map(f).reduce(add)文件“D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ rdd.py”,第835行,在reduce文件“D中:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ rdd.py“,第809行,收集文件”D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ py4j-0.10.4-src.zip \ py4j \ java_gateway.py“,第1133行,在调用文件”D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ pyspark中 . zip \ pyspark \ sql \ utils.py“,第63行,在deco文件中”D:\ spark-2.1.0-bin-hadoop2.7 \ python \ lib \ py4j-0.10.4-src.zip \ py4j \ protocol .py“,第319行,在get_return_value py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时发生错误 . :org.apache.spark.SparkException:作业因阶段失败而中止:阶段0.0中的任务0失败1次,最近失败:阶段0.0中丢失的任务0.0(TID 0,localhost, Actuator 驱动程序):org.apache.spark .SparkException:Python工作者没有及时连接到org.apache.spark.api.pyi.pyy.pyy.pyy.pyy.PythonWorkerFactory.create上的org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138) . (PythonWorkerFactory.scala :67)org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)位于org.apache.spark.api的org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) . 位于org.apache.spark.rdd.RDD.iterator的org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)的python.PythonRDD.compute(PythonRDD.scala:63)(RDD.scala:287) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)org.apache.spark.scheduler.Task.run(Task.scala:99)at org.apache.spark.executor.Executor $ TaskRunner java.util.concur中的.run(Executor.scala:282)在java.lang.Thread.run(java.lang.Thread.run)中的java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)上的rent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)引起:java .net.SocketTimeoutException:接受java.net.DualStackPocket上的java.net.DualStackPlainSocketImpl.waitForNewConnection(本地方法)中的超时时间(java).AbstractPlainSocketImpl.Cn上的java.net.DualStackSlainSocketImpl.smplAccept(DualStackPlainSocketImpl.java:135) . (AbstractPlainSocketImpl.java:401)在java.net.ServerSocket.implAccept(ServerSocket.java:514)的java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)中,位于org.apache的java.net.ServerSocket.accept(ServerSocket.java:513) . spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)... 12个驱动程序堆栈跟踪:at org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler . scala:1435)at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(D AGScheduler.scala:1423)at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1422)at scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) at scg.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)at org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:802)at scala.Option.foreach(Option.scala:257)at at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive( DAGScheduler.scala:1605)在org.apache.spark.util.Eve的org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) ntLoop $$ anon $ 1.run(EventLoop.scala:48)org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958) )org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:935)at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)org.apache.spark.rdd.RDD.withScope(RDD.scala:362)org.apache.spark.rdd.RDD . 收集(RDD.scala:934)org.apache.apark.api.python.PythonRDD $ .collectAndServe(PythonRDD.scala:453)atg.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)at at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)位于sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method . 在py4j.reflection的一个py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:354)的py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)调用(Method.java:498)(Gateway.java:280) )py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)at py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang . Thread.run(Thread.java:748)引起:org.apache.spark.SparkException:Python工作者没有及时连接到org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138) org.apache.apark.api.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67)atg.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)org.apache.spark.api.python.PythonRunner .compute(PythonRDD.scala:128)atg.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)at at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run( Task.scala:99)在java.util.concurrent的java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)的org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282) .ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)... 1更多引起:java.net.SocketTimeoutException:接受java.net.DualStackPocket上的超时连接(本机方法)java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135)java.net.AbstractPlainSocketImpl.accept( AbstractPlainSocketImpl.java:409)位于java.net.ServerSocket.inmpA上的java.net.ServerSocket.implAccept(ServerSocket.java:545)的java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)(ServerSocket.java:513) )org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133)... 12更多进程以退出代码1结束

更新

1.在CMD中类型化的pyspark运行正常

2.python版本使用的是3.5.4