Question

我有一个数据框df，我想按日期对其进行分区（df中的一列）。我有以下代码：

df.write.partitionBy('date').mode(overwrite').orc('path')

然后在上面的路径下，有很多文件夹，例如日期= 2018-10-08等... 但是在文件夹date = 2018-10-08下，有5个文件，我想减少到date = 2018-10-08文件夹内的一个文件。怎么做？我仍然希望按日期对它进行分区。

提前谢谢！

Answer 1

为了每个分区文件夹有1个文件，您需要在写入之前按分区列对数据进行重新分区。这将重新整理数据，使日期位于相同的DataFrame / RDD分区中：

ld: warning: object file (/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/lazydylib1.o) was built for newer OSX version (10.14) than being linked (10.11)

Pyspark数据框分区号

1 个答案: