输入Spark Scala Dataframe列作为矢量

scala和Spark API工具包相对较新,但我有一个问题,试图利用矢量汇编程序

http://spark.apache.org/docs/latest/ml-features.html#vectorassembler

然后利用矩阵相关性

https://spark.apache.org/docs/2.1.0/mllib-statistics.html#correlations

dataframe列是dtype linalg.Vector

val assembler = new VectorAssembler()

val trainwlabels3 = assembler.transform(trainwlabels2)

trainwlabels3.dtypes(0)

res90: (String, String) = (features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7)

然而,为统计工具调用此RDD会引发不匹配错误 .

val data: RDD[Vector] = sc.parallelize(
  trainwlabels3("features")
) 

<console>:80: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Seq[org.apache.spark.mllib.linalg.Vector]

在此先感谢您的帮助 .

回答(2)

3 years ago

你应该选择:

val features = trainwlabels3.select($"features")

转换为RDD

val featuresRDD = features.rdd

map

featuresRDD.map(_.getAs[Vector]("features"))

3 years ago

这应该适合你:

val rddForStatistics = new VectorAssembler()
   .transform(trainwlabels2)
   .select($"features")
   .as[Vector] //turns Dataset[Row] (a.k.a DataFrame) to DataSet[Vector]
   .rdd

但是,您应该避免 RDD 并找出如何使用基于 DataFrame 的API(在 spark.ml 包中)执行您想要的操作,因为使用 RDD 已在MLlib中全部弃用 .