Question

我有一个hadoop作业，可以将许多部分输出到hdfs，例如输出到某个文件夹。

例如：

/output/s3/2014-09-10/part...

最好的方法是什么，使用s3 java api将这些部分上传到s3中的signle文件

例如

s3:/jobBucket/output-file-2014-09-10.csv

作为一种可能的解决方案，可以选择合并部件并将结果写入hdfs单个文件，但这会创建一个双I / O. 使用单个减速器也不是选项

谢谢，

Answer 1

尝试使用FileUtil#copyMerge方法，它允许您在两个文件系统之间复制数据。我还发现S3DistCp tool可以将数据从HDFS复制到Amazon S3。您可以指定--groupBy,(.*)选项来合并文件。

Answer 2

Spark流程代码段

void sparkProcess(){
    SparkConf sparkConf = new SparkConf().setAppName("name");
    JavaSparkContext sc = new JavaSparkContext(sparkConf)
    Configuration hadoopConf = sc.hadoopConfiguration();
    hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
    hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
    String folderPath = "s3://bucket/output/folder";
    String mergedFilePath = "s3://bucket/output/result.txt";
    BatchFileUtil.copyMerge(hadoopConf, folderPath, mergedFilePath);
}    

public static boolean copyMerge(Configuration hadoopConfig, String srcPath, String dstPath) throws IOException, URISyntaxException {
    FileSystem hdfs = FileSystem.get(new URI(srcPath), hadoopConfig);
    return FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null);
}

Answer 3

使用java hdfs api读取文件，然后使用标准Java streamy类型的东西转换为InputStream，然后使用

http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html

另见

https://stackoverflow.com/a/11116119/1586965

如何将多个文件从hdfs上传到单个s3文件？

3 个答案: