首页 文章

如何将决策树与CSV文件中的数据集一起使用? [关闭]

提问于
浏览
0

我'd like to use Spark MLlib' s org.apache.spark.mllib.tree.DecisionTree ,如下面的代码,但编译失败 .

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SparkSession

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read.format("csv").load("C:/spark/spark-2.1.0-bin-hadoop2.7/data/mllib/airlines.txt")
val df = sqlContext.read.csv("C:/spark/spark-2.1.0-bin-hadoop2.7/data/mllib/airlines.txt")
val dataframe = sqlContext.createDataFrame(df).toDF("label");
val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,impurity, maxDepth, maxBins)

编译失败,并显示以下错误消息:

<console>:44:错误:重载方法值trainClassifier with alternatives :(输入:org.apache.spark.api.java.JavaRDD [org.apache.spark.mllib.regression.LabeledPoint],numClasses:Int,categoricalFeaturesInfo: java.util.Map [Integer,Integer],impurity:String,maxDepth:Int,maxBins:Int)org.apache.spark.mllib.tree.model.DecisionTreeModel(input:org.apache.spark.rdd.RDD [org] .apache.spark.mllib.regression.LabeledPoint],numClasses:Int,categoricalFeaturesInfo:scala.collection.immutable.Map [Int,Int],impurity:String,maxDepth:Int,maxBins:Int)org.apache.spark.mllib .tree.model.DecisionTreeModel不能应用于(org.apache.spark.sql.Dataset [org.apache.spark.sql.Row],Int,scala.collection.immutable.Map [Int,Int],String,Int ,Int)val model = DecisionTree.trainClassifier(trainingData,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)

1 回答

  • 1

    您使用旧的基于RDD的 DecisionTree 与Spark SQL的新数据集API,因此编译错误:

    无法应用于(org.apache.spark.sql.Dataset [org.apache.spark.sql.Row],Int,scala.collection.immutable.Map [Int,Int],String,Int,Int)val模型= DecisionTree.trainClassifier(trainingData,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)

    请注意第一个输入参数类型 org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] ,但 DecisionTree 需要 org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint] .

    引用Announcement: DataFrame-based API is primary API

    从Spark 2.0开始,spark.mllib包中基于RDD的API已进入维护模式 . Spark的主要机器学习API现在是spark.ml包中基于DataFrame的API .

    请根据Decision trees更改您的代码:

    spark.ml实现支持使用连续和分类特征进行二进制和多类分类以及回归的决策树 . 该实现按行分区数据,允许分布式培训数百万甚至数十亿实例 .

相关问题