我正在尝试从.txt
文件中删除单词RDD中的停用词。
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))
// Create a tuple of test words
val testWords = ("you", "to")
// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))
上面的代码对我来说很有意义,似乎工作得很好。这是我遇到麻烦的地方。
//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)
这将运行,但实际上并没有过滤掉元组testWords
中包含的单词。我不确定如何测试wordsWithStopWords
中的单词对我元组中的每个单词testWords
答案 0 :(得分:4)
您可以使用广播变量来过滤掉您的停用词RDD:
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)
// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))
广播变量允许您在每台计算机上保留一个只读变量,而不是随副本一起发送它的副本。例如,它们可用于以有效的方式为每个节点提供大输入数据集的副本(在这种情况下也是如此)。
答案 1 :(得分:3)
您正在通过元组("you", "to")
测试字符串,该元组始终为false。
以下是您要尝试的内容:
val testWords = Set("you", "to")
wordsWithStopWords.filter(!testWords.contains(_))
// Simulating the RDD with a List (works the same with RDD)
List("hello", "to", "yes") filter (!testWords.contains(_))
// res30: List[String] = List(hello, yes)
答案 2 :(得分:0)
使用减去键:
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
// Split using a regular expression that extracts words from input RDD
val wordsWithInput = input.flatMap(x => x.split("\\W+"))
//Converting above RDDs to lowercase
val lowercaseInput = wordsWithInput.map(x => x.toLowerCase())
val lowercaseStopWordsInput = stopWordsInput.map(x => x.toLowerCase())
//Creating a tuple(word, 1) using map for above RDDs
val tupleInput = lowercaseInput.map(x => (x,1))
val tupleStopWordsInput = lowercaseStopWordsInput.map(x => (x,1))
//using subtractByKey
val tupleWords = tupleInput.subtractByKey(tupleStopWordsInput)
//to have only words in RDD
val words = tupleWords.keys