过滤Spark中的停用词-Java 学习之路

我试图从 .txt 文件中过滤掉RDD单词中的停用词 .

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))

// Create a tuple of test words
val testWords = ("you", "to")

// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))

上面的代码对我来说很有意义，似乎工作得很好 . 这是我遇到麻烦的地方 .

//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)

这将运行，但实际上并没有过滤掉元组 testWords 中包含的单词 . 我不确定如何测试 wordsWithStopWords 中的单词对我元组中的每个单词 testWords

2 回答

您正在测试元组 ("you", "to") 上的字符串，它始终为false .

这是你想要尝试的：

val testWords = Set("you", "to")
wordsWithStopWords.filter(!testWords.contains(_))

// Simulating the RDD with a List (works the same with RDD)
List("hello", "to", "yes") filter (!testWords.contains(_))
// res30: List[String] = List(hello, yes)

回复于 2024-04-29T15:16:43+08:00

您可以使用广播变量来过滤掉您的停用词RDD：

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)

// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))

广播变量允许您在每台计算机上保留一个只读变量，而不是通过任务传送它的副本 . 例如，它们可用于以有效的方式为每个节点提供大输入数据集的副本（在这种情况下也是如此） .

回复于 2024-04-29T15:16:43+08:00

过滤Spark中的停用词

2 回答

相关问题