从Spark写入S3而没有访问权限和密钥

时间:2017-01-19 09:09:34

标签: apache-spark amazon-s3 amazon-ec2 permissions

我们的EC2服务器配置为允许在使用my-bucket时访问DefaultAWSCredentialsProviderChain,因此使用普通AWS SDK的以下代码可以正常工作:

AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my-file.txt"));

Spark的S3AOutputStream在内部使用相同的SDK,但尝试上传文件而不提供访问权限和密钥却无效:

sc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// not setting access and secret key
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("hello", "stackoverflow"));
rdd.saveAsTextFile("s3a://my-bucket/my-file-txt");

给出:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 25DF243A166206A0, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: Ki5SP11xQEMKb0m0UZNXb4FhfWLMdbehbknQ+jeZuO/wjhwurjkFoEYVfrQfW1KIq435Lo9jPkw=  
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)  
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)  
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)  
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)  
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)
    <truncated>

有没有办法强制Spark使用默认凭据提供程序链而不是依赖访问和密钥?

1 个答案:

答案 0 :(得分:1)

技术上,Hadoop的s3a输出流。查看堆栈跟踪以查看对bug报告的提交者:)

s3a确实支持来自Hadoop 2.7 +的实例凭证,proof

如果您无法连接,则需要在CP上安装2.7 JAR,并使用AWS SDK的确切版本(我记得1.7.4)。

Spark有一个小功能:如果您提交工作并且设置了AWS_ * env变量,那么它会将它们选中,然后将它们复制为fs.s3a密钥,然后将它们传播到您的系统。