Question

我想将json数据保存到hdfs中的单个文件中。目前我的方法是使用spark将数据保存到hdfs然后将数据合并到本地（local_tmp_file）然后将其移动到hdfs（dest）

getmerge_command = 'hdfs dfs -getmerge ' + dest + ' ' + local_tmp_file
move_command = 'hdfs dfs -moveFromLocal ' + local_tmp_file + ' ' + dest

当有很多进程同时运行并使用临时本地存储使磁盘满时发生问题。有人对此有任何解决方案吗？

Answer 1

保存数据时使用重新分区（1）

df.repartition(1).write.mode("overwrite").format("json").save("test_file")

Answer 2

如果我们正在减少分区，因为它更优化了repartition（）的版本，因此使用coalesce（）会更好，因为它可以避免数据的完全混乱。

df.coalesce(1).write.mode("overwrite").format("json").save("test_file")

有关重新分区和合并的更多详细信息，请检查此项， Spark - repartition() vs coalesce()