Question

Spark v2.4 no Hive

Spark从bucketBy中受益，因为它知道DataFrame具有正确的分区。 sortBy呢？

spark.range(100, numPartitions=1).write.bucketBy(3, 'id').sortBy('id').saveAsTable('df')

# No need to `repartition`.
spark.table('df').repartition(3, 'id').explain()
# == Physical Plan ==
# *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, # SelectedBucketsCount: 3 out of 3

# Still need to `sortWithinPartitions`.
spark.table('df').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(1) Sort [id#33620L ASC NULLS FIRST], false, 0
# +- *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 3 out of 3

因此省略了其他repartition。但是，sortWithinPartitions不是。 sortBy有用吗？我们可以完全使用sortBy来加速表连接吗？

Answer 1

简短答案：持久性表中的sortBy（至少目前）没有任何好处。

更长的答案：

关于支持，虽然Spark可以保存存储桶式DataFrame，但

Spark和Hive并没有实现相同的语义或 operational 规范。进入Hive表。

首先，两个框架之间的存储单元是不同的：单个存储桶文件（配置单元）与每个存储桶的文件集合（火花）。

第二

在配置单元中，每个存储段都进行了全局排序，可以优化读取数据的查询。

在火花中，直到本期https://issues.apache.org/jira/browse/SPARK-19256 得到（希望）解决后，每个文件都进行了单独排序，但整个存储桶并未进行全局排序。

因此，由于排序不是全局的，因此sortBy表没有没有好处。

我希望这能回答您的问题。

Spark是否可以从持久化表中的`sortBy`中受益？

1 个答案: