Spark的火花决策树-Java 学习之路

我正在通过以下网站阅读决策树分类部分 . http://spark.apache.org/docs/latest/mllib-decision-tree.html

我在我的笔记本电脑中构建了示例代码并尝试了解它's output. but I couldn' t了解了一下 . 以下是代码，sample_libsvm_data.txt可以在https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt下面找到

请参考输出，让我知道我的意见是否正确 . 这是我的意见 .

测试错误意味着它根据训练数据进行了大约95％的修正？
（最奇怪的一个）如果特征434大于0.0那么，基于基尼杂质会是1吗？例如，值为434：178，则为1 .

from __future__ import print_function
from pyspark import SparkContext
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

if __name__ == "__main__":
  sc = SparkContext(appName="PythonDecisionTreeClassificationExample")
  data = MLUtils.loadLibSVMFile(sc,'/home/spark/bin/sample_libsvm_data.txt')
  (trainingData, testData) = data.randomSplit([0.7, 0.3])

  model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)

  predictions = model.predict(testData.map(lambda x: x.features))
  labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
  testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())

print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

// =====Below is my output=====
Test Error = 0.0454545454545
Learned classification tree model:
DecisionTreeModel classifier of depth 1 with 3 nodes
If (feature 434 <= 0.0)
  Predict: 0.0
Else (feature 434 > 0.0)
  Predict: 1.0

2 回答

2

我相信你是对的 . 是的，您的错误率大约是5％，因此您的算法在大约95％的时间内是正确的，因为您保留的30％的数据都是测试的 . 根据你的输出（我假设是正确的，我没有自己测试代码），是的，确定观察类的唯一特征是特征434，如果它小于0则为0，否则为1 .

回复于 2024-05-02T11:45:23+08:00
0

为什么在Spark ML中，在训练决策树模型时，minInfoGain或每个节点的最小实例数不用于控制树的增长？过度生长树很容易 .

回复于 2024-05-02T11:45:23+08:00

Spark的火花决策树

2 回答

相关问题