Question

我有一个任务是在一个巨大的文件（gigabytes）中找到最常用的字符串。我编写了以下spark程序来在文件中创建RDD。

val conf = new Sparkconf()      //initializing sparkConf
val sc = new SparkContext(conf) //initializing SparkContext
val input = sc.textfile("..../input path")   //load the input path
val words = input.flatMap(line=>line.split(""))   //split by words
val counts = words.map(word=>(word, reducebyKey{case(x,y) => x+y})     //reducebykey to count the number of repeating words
val topcount = counts.top(5)                    //select top 5

然而，这个前5名并没有让我获得最多的使用。仅在flatmap之后返回顶部元素。

Answer 1

获取每个单词的计数后，按降序排序。

val conf = new Sparkconf() //initializing sparkConf 
val sc = new SparkContext(conf) //initializing SparkContext 
val input = sc.textfile("..../input path") //load the input path 
val words = input.flatMap(line=>line.split("")) //split by words 
val wordCounts = words.map(x => (x, 1)).reduceByKey( (x,y) => x + y ) //to count the number of repeating words
//Flip (word, count) tuples to (count, word) and then sort by key (the counts)
val wordCountsSorted = wordCounts.map( x => (x._2, x._1) ).sortByKey(false,1) 
val topcount = wordCountsSorted.top(5) //select top 5

在reduceByKey opeteratin之后，我们得到一个rdd [（word，count）]，例如[（“we”，5）]。之后，我们将在wordCountsSorted中移动键值对以获得类似[（5，“we”）]的值，以便我们可以对其应用sortByKey操作。

sortByKey（[升序]，[numTasks]）在K实现Ordered的（K，V）对数据集上调用时，返回按键按升序或降序排序的（K，V）对数据集order，如布尔升序参数中所指定的。

参考here

Top N主要用在巨大的列表中

1 个答案: