Top N主要用在巨大的列表中

时间:2017-01-12 00:38:43

标签: scala apache-spark

我有一个任务是在一个巨大的文件(gigabytes)中找到最常用的字符串。我编写了以下spark程序来在文件中创建RDD。

val conf = new Sparkconf()      //initializing sparkConf
val sc = new SparkContext(conf) //initializing SparkContext
val input = sc.textfile("..../input path")   //load the input path
val words = input.flatMap(line=>line.split(""))   //split by words
val counts = words.map(word=>(word, reducebyKey{case(x,y) => x+y})     //reducebykey to count the number of repeating words
val topcount = counts.top(5)                    //select top 5

然而,这个前5名并没有让我获得最多的使用。仅在flatmap之后返回顶部元素。

1 个答案:

答案 0 :(得分:0)

获取每个单词的计数后,按降序排序。

val conf = new Sparkconf() //initializing sparkConf 
val sc = new SparkContext(conf) //initializing SparkContext 
val input = sc.textfile("..../input path") //load the input path 
val words = input.flatMap(line=>line.split("")) //split by words 
val wordCounts = words.map(x => (x, 1)).reduceByKey( (x,y) => x + y ) //to count the number of repeating words
//Flip (word, count) tuples to (count, word) and then sort by key (the counts)
val wordCountsSorted = wordCounts.map( x => (x._2, x._1) ).sortByKey(false,1) 
val topcount = wordCountsSorted.top(5) //select top 5

在reduceByKey opeteratin之后,我们得到一个rdd [(word,count)],例如[(“we”,5)]。之后,我们将在wordCountsSorted中移动键值对以获得类似[(5,“we”)]的值,以便我们可以对其应用sortByKey操作。

sortByKey([升序],[numTasks])在K实现Ordered的(K,V)对数据集上调用时,返回按键按升序或降序排序的(K,V)对数据集order,如布尔升序参数中所指定的。

参考here