首页 文章

PySpark中的广播随机森林模型

提问于
浏览 201
1

我正在使用spark 1.4.1 . 当我试图播放随机森林模型时,它向我显示了这个错误:

Traceback (most recent call last):
  File "/gpfs/haifa/home/d/a/davidbi/codeBook/Nice.py", line 358, in <module>
broadModel = sc.broadcast(model)
  File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/context.py", line 698, in broadcast
  File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/broadcast.py", line 70, in __init__
  File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/broadcast.py", line 78, in dump
File "/opt/apache/spark-1.4.1-bin-hadoop2.4_doop/python/lib/pyspark.zip/pyspark/context.py", line 252, in __getnewargs__
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

我正在尝试执行的代码示例:

sc = SparkContext(appName= "Something")
model = RandomForest.trainRegressor(sc.parallelize(data), categoricalFeaturesInfo=categorical, numTrees=100, featureSubsetStrategy="auto", impurity='variance', maxDepth=4)
broadModel= sc.broadcast(model)

如果有人可以帮助我,我会非常感激!非常感谢!

1 回答

  • 1

    简短的回答是使用PySpark是不可能的 . 预测所需的 callJavaFunc 使用 SparkContext 因此错误 . 虽然可以使用Scala API做这样的事情 .

    在Python中,您可以使用与单个模型相同的方法,它意味着 model.predict ,后跟 zip .

    models = [mode1, mode2, mode3]
    
    predictions = [
        model.predict(testData.map(lambda x: x.features)) for model in models]
    
    def flatten(x):
        if isinstance(x[0], tuple):
            return tuple(list(x[0]) + [x[1]])
        else:
            return x
    
    (testData
       .map(lambda lp: lp.label)
       .zip(reduce(lambda p1, p2: p1.zip(p2).map(flatten), predictions)))
    

    如果想了解更多有关问题根源的信息,请查看:How to use Java/Scala function from an action or a transformation?

相关问题