我在Apache Spark ML(版本2.1.0)中使用NaiveBayes多项分类器来预测一些文本类别 .

使用StringIndexer将字符串转换为标签,如下所示:

val labelIndexer = new StringIndexer().setInputCol("name").setOutputCol("label").fit(trainData).setHandleInvalid("skip")

它给出了一个例外,而对测试数据的预测只有单一记录,这是一个看不见的标签 . 如果有看到和看不见的标签的组合,那么它的工作正常,它将跳过预测结果中看不见的标签记录 .

Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
at scala.collection.IterableLike$class.head(IterableLike.scala:91)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1943)
at org.apache.spark.sql.Dataset.first(Dataset.scala:1950)
at org.apache.spark.ml.feature.VectorAssembler.first$lzycompute$1(VectorAssembler.scala:57)
at org.apache.spark.ml.feature.VectorAssembler.org$apache$spark$ml$feature$VectorAssembler$$first$1(VectorAssembler.scala:57)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2$$anonfun$1.apply$mcI$sp(VectorAssembler.scala:88)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2$$anonfun$1.apply(VectorAssembler.scala:88)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2$$anonfun$1.apply(VectorAssembler.scala:88)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:88)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:58)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:58)
at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:299)
at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:299)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:299)
at com.infostretch.machinelearning.sample.GroupingByNaiveBayesExample$.main(GroupingByNaiveBayesExample.scala:111)
at com.infostretch.machinelearning.sample.GroupingByNaiveBayesExample.main(GroupingByNaiveBayesExample.scala)

Training Data :

id,group,name,text 1,apple,abc,a b c d 2,orange,def,x y z

Test Data:

id,name,text 3,pqr,a b x

这里,在对测试数据进行预测时,字段名称值'pqr'对于模型是看不见的 .