我有一个任务是在一个巨大的文件(gigabytes)中找到最常用的字符串。我编写了以下spark程序来在文件中创建RDD。
val conf = new Sparkconf() //initializing sparkConf
val sc = new SparkContext(conf) //initializing SparkContext
val input = sc.textfile("..../input path") //load the input path
val words = input.flatMap(line=>line.split("")) //split by words
val counts = words.map(word=>(word, reducebyKey{case(x,y) => x+y}) //reducebykey to count the number of repeating words
val topcount = counts.top(5) //select top 5
然而,这个前5名并没有让我获得最多的使用。仅在flatmap之后返回顶部元素。
答案 0 :(得分:0)
获取每个单词的计数后,按降序排序。
val conf = new Sparkconf() //initializing sparkConf
val sc = new SparkContext(conf) //initializing SparkContext
val input = sc.textfile("..../input path") //load the input path
val words = input.flatMap(line=>line.split("")) //split by words
val wordCounts = words.map(x => (x, 1)).reduceByKey( (x,y) => x + y ) //to count the number of repeating words
//Flip (word, count) tuples to (count, word) and then sort by key (the counts)
val wordCountsSorted = wordCounts.map( x => (x._2, x._1) ).sortByKey(false,1)
val topcount = wordCountsSorted.top(5) //select top 5
在reduceByKey opeteratin之后,我们得到一个rdd [(word,count)],例如[(“we”,5)]。之后,我们将在wordCountsSorted中移动键值对以获得类似[(5,“we”)]的值,以便我们可以对其应用sortByKey操作。
sortByKey([升序],[numTasks])在K实现Ordered的(K,V)对数据集上调用时,返回按键按升序或降序排序的(K,V)对数据集order,如布尔升序参数中所指定的。
参考here