Question

我有2个配对的RDD，我使用相同的键将它们连接在一起，现在我想使用其中一个值对结果进行排序。新加入的RDD类型是：RDD [（（String，Int），Iterable [（（String，DateTime，Int，Int），（String，DateTime，String，String））]）]

其中第一部分是配对的RDD密钥，可迭代部分是我加入的两个RDD的值。我现在想要通过第二个RDD的Time字段对它们进行排序。我尝试使用sortBy函数，但是我遇到了错误。

有什么想法吗？

由于

Answer 1

Spark对RDDs有一个mapValues方法。我认为它会对你有所帮助。

    def mapValues[U](f: (V) ⇒ U): RDD[(K, U)]
    Pass each value in the key-value pair RDD through a map function 
without changing the keys; this also retains the original RDD's partitioning.

Spark Documentation有更多详情。

Answer 2

您可以使用sortBy功能

val yourRdd: RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])] = ...(your cogroup operation here)

val result = yourRdd.sortBy({
  case ((str, i), iter) if iter.nonEmpty => iter.head._2._
  }, true)

iter.head的类型为((String, DateTime, Int,Int), (String, DateTime, String, String));

iter.head._2的类型为(String, DateTime, String, String)和

iter.head._2._2的确属于DateTime。

也许您应该像this那样为Datetime提供隐式排序对象。顺便说一句，迭代器可能是emtpy吗？然后你应该将这个案例添加到sortBy函数中。如果这个迭代器中有许多项目可供选择进行排序？

Answer 3

如果需要对RDD的Iterable进行排序：

val rdd: RDD[((String, Int), 
             Iterable[((String, DateTime, Int,Int), 
                       (String, DateTime, String, String))])] = ???

val dateOrdering = new Ordering[org.joda.time.DateTime]{ 
    override def compare(a: org.joda.time.DateTime,
                         b: org.joda.time.DateTime) = 
        if (a.isBefore(b)) -1 else 1
}

rdd.mapValues(v => v.toArray
                    .sortBy(x => x._2._2)(dateOrdering))

Answer 4

使用python：

sortedRDD = unsortedRDD.sortBy(lambda x:x[1][1], False)

这将按降序排序

在连接后按火花对RDD中的值排序

4 个答案: