Question

Hadoop MapReduce shuffle的默认行为是对分区内的随机密钥进行排序，但不是跨分区（这是使密钥按部分排序的总排序）

我会问如何使用Spark RDD实现相同的功能（在分区内排序，但不能跨分区排序）

RDD＆＃39; @Getter @Setter @NoArgsConstructor @AllArgsConstructor @XmlRootElement public class TravelsDTO implements Serializable { private static final long serialVersionUID = 1L; private Long id; private BigDecimal amount; private String displayName; }方法正在进行总排序
RDD＆＃39; sortByKey正在分区内进行排序，但没有跨越分区，但不幸的是，它增加了一个额外的步骤来进行重新分区。

是否有直接的方法在分区内排序但不跨越分区？

Answer 1

您可以使用Dataset和sortWithinPartitions方法：

import spark.implicits._

sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
  .toDF("text")
  .sortWithinPartitions($"text")
  .show

+----+
|text|
+----+
|   d|
|   e|
|   f|
|   a|
|   b|
|   c|
+----+

通常，shuffle是排序分区的一个重要因素，因为它重用shuffle结构进行排序，而不会立即将所有数据加载到内存中。

Answer 2

I've never had this need before, but my first guess would be to use any of the *Partition* methods (e.g. foreachPartition or mapPartitions) to do the sorting within every partition.

Since they give you a Scala Iterator, you could use it.toSeq and then apply any of the sorting methods of Seq, e.g. sortBy or sortWith or sorted.

如何使用RDD API在分区内排序（并避免跨分区排序）？

2 个答案: