过滤Spark中的停用词

时间:2017-01-12 16:38:13

标签: scala apache-spark

我正在尝试从.txt文件中删除单词RDD中的停用词。

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))

// Create a tuple of test words
val testWords = ("you", "to")

// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))

上面的代码对我来说很有意义,似乎工作得很好。这是我遇到麻烦的地方。

//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)

这将运行,但实际上并没有过滤掉元组testWords中包含的单词。我不确定如何测试wordsWithStopWords中的单词对我元组中的每个单词testWords

3 个答案:

答案 0 :(得分:4)

您可以使用广播变量来过滤掉您的停用词RDD:

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)

// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))

广播变量允许您在每台计算机上保留一个只读变量,而不是随副本一起发送它的副本。例如,它们可用于以有效的方式为每个节点提供大输入数据集的副本(在这种情况下也是如此)。

答案 1 :(得分:3)

您正在通过元组("you", "to")测试字符串,该元组始终为false。

以下是您要尝试的内容:

val testWords = Set("you", "to")
wordsWithStopWords.filter(!testWords.contains(_))

// Simulating the RDD with a List (works the same with RDD)
List("hello", "to", "yes") filter (!testWords.contains(_))
// res30: List[String] = List(hello, yes)

答案 2 :(得分:0)

使用减去键:

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Split using a regular expression that extracts words from input RDD
val wordsWithInput = input.flatMap(x => x.split("\\W+"))


//Converting above RDDs to lowercase
val lowercaseInput = wordsWithInput.map(x => x.toLowerCase())
val lowercaseStopWordsInput = stopWordsInput.map(x => x.toLowerCase())

//Creating a tuple(word, 1) using map for above RDDs
val tupleInput = lowercaseInput.map(x => (x,1))
val tupleStopWordsInput = lowercaseStopWordsInput.map(x => (x,1))

//using subtractByKey
val tupleWords = tupleInput.subtractByKey(tupleStopWordsInput)

//to have only words in RDD
val words = tupleWords.keys