Question

我在hdfs中有一个文件夹，其中包含202个零件文件，这是作业的输出。总大小为195GB。我想将所有这些文件合并到hdfs中的单个文件中。有什么方法可以非常快地做到这一点。我们正在使用Microsoft azure云平台，而我们正在使用的Spark分发是HDInsight。

我们尝试了一些命令，所有这些命令都花费大量时间（4个多小时）。请帮忙。

sc.textFile("/Dataproviders/Temp/MDASHistory/KAI/Order/Output2/*").coalesce(1).saveAsTextFile("/Dataproviders/Temp/MDASHistory/KAI/Order/MergedFileSp.out") 

hdfs dfs -getmerge /Dataproviders/Temp/MDASHistory/KAI/Order/Output2/* final.dat

org.talend.hadoop.fs.FileUtil.copyMerge(fs,
                        sourceDirPath_tFileOutputDelimited_1, fs,
                        targetFilePath_tFileOutputDelimited_1, false, job,
                        null, headerByteCount_tFileOutputDelimited_1);

在Apache Spark中快速合并零件文件

0 个答案: