我正在运行的代码看起来像这样
val termCounts: Array[(String, Long)] = tokenized.flatMap(_.map(_ -> 1L)).reduceByKey(_ + _).collect().sortBy(-_._2)
// vocabArray: Chosen vocab (removing common terms)
val numStopwords = 20
val stopWord = sc.wholeTextFiles(".../stopword.txt")
val vocabArray1: Array[String] =
termCounts.takeRight(termCounts.size - numStopwords).map(_._1)
val vocabArray = vocabArray1 diff stopWord
看,我想使用diff函数,它只适用于相同的类型。
答案 0 :(得分:1)
当你使用sc.wholeTextFiles(" / root / folder / to / textfiles /")时,它会将该文件夹中的每个部分文件读成一个字符串。
所以,如果你的设置是
/root/folder/to/textfiles/
.../part1.txt
.../part2.txt
.../part3.txt
part1.txt,part2.txt,part3.txt都被读作单个记录。所以你的RDD [(String,String)]将是一对文件名的路径,整个文件是一个字符串。
像这样。 ("/root/folder/to/text/files/part1.txt", "actual contents of part1.txt as a String"),
("/root/folder/to/text/files/part2.txt", "actual contents of part2.txt as a String")
...
您可能希望在映射之前标记每个文件的实际内容。
stopWord.flatMap(tokenize(_._2)).collect()