我试图访问PySpark中随机森林模型的各个树元素 . 特别是,我试图从各个树中获得所有预测;我出于特殊原因需要这个 .

不幸的是,Spark ML API只暴露单个树而不是预测 .

  • Pro:单个树可用于进行预测 .

  • 骗局:看起来真的很慢 .

首先,我将一个简单的随机森林模型拟合到n = 200的数据集,其中70/30列车/测试分裂 .

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# Create model
featureCols = ["age", "shoeSize", "score"]
assembler = VectorAssembler(inputCols = featureCols,
                           outputCol = "features")

train_feat = assembler.transform(train)
test_feat = assembler.transform(test)

# Fit model
model = rf.fit(train_feat)

然后,我时间:

  • 预测测试集上的随机森林模型

  • 预测测试集上的单个树

Results for the random forest:

# How fast is the overall random forest prediction?
%timeit model.transform(test_feat).select('rowNum','probability')
%timeit model.transform(test_feat).select('rowNum','probability').collect()
24.9 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.51 s ± 36.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Results for the individual tree:

# How fast is accessing a single tree?
%timeit model.trees[0].transform(test_feat).select('rowNum','probability')
%timeit model.trees[0].transform(test_feat).select('rowNum','probability').collect()
627 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.12 s ± 280 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

KEY QUESTION: Why is it so much slower getting the results from an individual tree?

我实际上需要来自所有树的所有预测(即ndata x nTree集),因此在各个树上循环将非常慢 . 有500棵树,我正在寻找~0.6s x 500棵树=最少约5分钟来获得所有树木的预测 .

有没有快速的方法来获得所有单独的树预测?我需要进入Scala才能这样做吗?

Alternative: Is there a vectorised way to do this?

即使单个树的概率稍微慢一点,我是否可以通过某种方式使用map / reduce函数来有效地对其进行矢量化,或者在没有开销的情况下将各个计算变为农场?

我试图通过创建一个向量来做到这一点

treeNum = range(0,nTrees)

其中nTrees = 500,然后使用我的训练数据计算此笛卡尔连接 .

我尝试应用UDF并索引treeNum所需的模型编号,但我无法解决如何在PySpark中执行此操作 .