如何使用RDD API在分区内排序(并避免跨分区排序)?

时间:2017-04-11 07:08:40

标签: apache-spark

Hadoop MapReduce shuffle的默认行为是对分区内的随机密钥进行排序,但不是跨分区(这是使密钥按部分排序的总排序)

我会问如何使用Spark RDD实现相同的功能(在分区内排序,但不能跨分区排序)

  1. RDD' @Getter @Setter @NoArgsConstructor @AllArgsConstructor @XmlRootElement public class TravelsDTO implements Serializable { private static final long serialVersionUID = 1L; private Long id; private BigDecimal amount; private String displayName; } 方法正在进行总排序
  2. RDD' sortByKey正在分区内进行排序,但没有跨越分区,但不幸的是,它增加了一个额外的步骤来进行重新分区。
  3. 是否有直接的方法在分区内排序但不跨越分区?

2 个答案:

答案 0 :(得分:10)

您可以使用DatasetsortWithinPartitions方法:

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

通常,shuffle是排序分区的一个重要因素,因为它重用shuffle结构进行排序,而不会立即将所有数据加载到内存中。

答案 1 :(得分:0)

I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.

Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.