Hadoop MapReduce shuffle的默认行为是对分区内的随机密钥进行排序,但不是跨分区(这是使密钥按部分排序的总排序)
我会问如何使用Spark RDD实现相同的功能(在分区内排序,但不能跨分区排序)
@Getter
@Setter
@NoArgsConstructor
@AllArgsConstructor
@XmlRootElement
public class TravelsDTO implements Serializable {
private static final long serialVersionUID = 1L;
private Long id;
private BigDecimal amount;
private String displayName;
}
方法正在进行总排序sortByKey
正在分区内进行排序,但没有跨越分区,但不幸的是,它增加了一个额外的步骤来进行重新分区。是否有直接的方法在分区内排序但不跨越分区?
答案 0 :(得分:10)
您可以使用Dataset
和sortWithinPartitions
方法:
import spark.implicits._
sc.parallelize(Seq("e", "d", "f", "b", "c", "a"), 2)
.toDF("text")
.sortWithinPartitions($"text")
.show
+----+
|text|
+----+
| d|
| e|
| f|
| a|
| b|
| c|
+----+
通常,shuffle是排序分区的一个重要因素,因为它重用shuffle结构进行排序,而不会立即将所有数据加载到内存中。
答案 1 :(得分:0)
I've never had this need before, but my first guess would be to use any of the *Partition*
methods (e.g. foreachPartition
or mapPartitions
) to do the sorting within every partition.
Since they give you a Scala Iterator
, you could use it.toSeq
and then apply any of the sorting methods of Seq, e.g. sortBy
or sortWith
or sorted
.