Question

当我使用Spark HiveContext来执行像insert overwrite a select * from b这样的sql时，最后，在表的相应HDFS目录中有许多小文件（400+），其中许多都是空文件。因此，我尝试使用coalesce来减少文件编号，示例代码为：

val df = hiveContext.sql("insert overwrite a select * from b")
df.coalesce(50).collect

但是输出文件仍然是400+，看起来像coalesce不起作用。

有人可以为此提供帮助吗？

Answer 1

您的示例不会合并输出文件，因为coalesce是在使用insert into执行SQL并在此insert into结果（我认为是空数据帧）后完成的。

尝试将代码重写为类似的代码：

hiveContext.sql("select * from b").coalesce(50).write.mode("overwrite").saveAsTable("a")