spark rank - scala基于RDD元组的一秒和三分之一

时间:2015-09-30 06:31:42

标签: scala sorting apache-spark rdd rank

嗨我想根据元组的第二个元素和第三个元素为每一行分配一个等级,这里我们有样本数据。我想添加" 1 "如果元组的第三个元素对id有最大值。如果元组的第三个元素具有相同的值,则基于元组的第二个元素,即第二个元素元组的最大值应该具有" 1 "作为第四个要素。元组值的所有其他第四个元素将为零。我希望你理解这个要求:

    (ID,Second,Third)->tuple
    (32609,878,199)
    (32609,832,199)
    (45470,231,199)
    (42482,1001,299)
    (42482,16,291)

代码:      * val Rank = matching.map {{case(x1,x2,x3)=> (X1,X2,X3,((x3.toInt * 100000)+ x2.toInt).toInt)}。sortBy(-_。 4).groupBy( ._ 1)*

结果:rank.take(10).foreach(println)

(32609,CompactBuffer((32609,878,199,19900878), (32609,832,199,19900832)))
(45470,CompactBuffer((45470,231,199,19900231)))
(42482,CompactBuffer((42482,1001,299,29901001), (42482,16,291,29100016)))

所需的输出将是:

(32609,878,199,1)
(32609,832,199,0)
(45470,231,199,1)
(42482,1001,299,1)
(42482,16,291,0)

2 个答案:

答案 0 :(得分:1)

似乎您可以尝试以下内容:

 val rank = matching.flatMap { case (x: String, y: String, z: String) => 
    val yInt = Try(y.toInt)
    val zInt = Try(z.toInt)
    if (yInt.isSuccess && zInt.isSuccess) Option((x, (yInt.get, zInt.get)))
    else None
 }.groupByKey().flatMap { case (key: String, tuples: Iterable[(Int, Int)]) =>
     val sorted = tuples.toList.sortBy(x => (-x._2, -x._1))
     val topRank = (key, sorted.head._1, sorted.head._2, 1)
     val restRank = for (tup <- sorted.tail) yield (key, tup._1, tup._2, 0)
     List(topRank) ++ restRank
 }

初始flatMap执行一些类型检查并将元组重新排序成对。第二个flatMap(在groupByKey之后)分别对第3和第2个元素进行排序,然后重新创建具有排名的元组。另请注意,您需要导入scala.util.Try才能使用此功能。

编辑:根据以下评论修改排名顺序。

答案 1 :(得分:1)

这应该可以解决问题

object App {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("Test").setMaster("local[4]")
    val sc = new SparkContext(sparkConf)

    val testData = List((32609,878,199),
      (32609,832,199),
      (45470,231,199),
      (42482,1001,299),
      (42482,16,291))

    val input = sc.parallelize(testData)

    val rank = input.groupBy(_._1).flatMapValues{
      x =>
        val sorted = x.toList.sortWith((x, y) => x._2 > y._2 || (x._2 == y._2 && x._3 > y._3))
        val first = sorted.head
        (first._1, first._2, first._3, 1) :: sorted.tail.map(t => (t._1, t._2, t._3, 0))
    }.map(_._2)

    // assign the partition ID to each item to see that each group is sorted
    val resultWithPartitionID = rank.mapPartitionsWithIndex((id, it) => it.map(x => (id, x)))

    // print the contents of the RDD, elements of different partitions might be interleaved
    resultWithPartitionID foreach println

    val collectedResult = resultWithPartitionID.collect.sortBy(_._1).map(_._2)

    // print collected results
    println(collectedResult.mkString("\n"))
  }
}

输出

(32609,878,199,1)
(32609,832,199,0)
(45470,231,199,1)
(42482,1001,299,1)
(42482,16,291,0)