我使用存储在AWS S3中的输入文件在本地计算机上运行Apache Spark(2.11,1.5.2)。如果文件存储在爱尔兰地区(eu-west-1)的存储桶中,则可以正常工作。
但如果我尝试读取存储在位于法兰克福(eu-central-1)的S3存储桶中的文件,则会失败,并显示错误消息:
不支持您提供的授权机制。请使用AWS4-HMAC-SHA256
如何使用AWS4-HMAC-SHA256?
详细的错误消息是:
Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 GET failed for '/%2myfolder' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRequest</Code><Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message><RequestId>ECB53FECECD1C910</RequestId><HostId>BmEyVcO/eHZR3IO2Z+8IkEWOn189IBGb2YAgbDxhTu+abuyORCEjHyC14l6nIRVNNnQL2Nyya9I=</HostId></Error>
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:174)
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:214)
...
Caused by: org.jets3t.service.S3ServiceException: S3 GET failed for '/%2myfolder' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRequest</Code><Message>The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.</Message><RequestId>ECB53FECECD1C910</RequestId><HostId>BmEyVcO/eHZR3IO2Z+8IkEWOn189IBGb2YAgbDxhTu+abuyORCEjHyC14l6nIRVNNnQL2Nyya9I=</HostId></Error>
at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:416)
at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestGet(RestS3Service.java:752)
代码是:
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class S3Problem {
public static void main(String[] args) {
String s3Folder = "s3n://mybucket/myfolder";
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> myData = sc.textFile(s3Folder).cache();
long count = myData.count();
System.out.println("Line count: " + count);
}
}
AWS_ACCESS_KEY_ID和AWS_SECRET_ACCESS_KEY作为环境变量提供。
答案 0 :(得分:7)
将Ewan和windsource的答案整合到PySpark的完整脚本中(<至少对我而言):
import findspark
findspark.init()
import pyspark
spark = pyspark.sql.SparkSession.builder \
.master("local[*]") \
.appName("Spark") \
.config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
.getOrCreate()
# Set the property for the driver. Doesn't work using the same syntax
# as the executor because the jvm has already been created.
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.proxy.host", "***")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.proxy.port", "8080")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", "***")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "***")
test = spark.sparkContext.textFile('s3a://my-bucket/test')
print(test.take(5))
答案 1 :(得分:3)
您的路径设置为s3://,我认为它应该是s3n://
尝试更改它,以及使用这些身份验证参数:
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3n.awsAccessKeyId","key")
hadoopConf.set("fs.s3n.awsSecretAccessKey","secret")
或者您可以尝试使用s3a://但是您必须将hadoop-aws和aws-java-sdk jar文件包含在CLASSPATH中。
答案 2 :(得分:0)
法兰克福地区正在使用auth v4,一种更简单的方法来使用s3n impl进行s3路径,在core-site.xml中设置类似的东西(对于s3和s3n路径都使用s3n)
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
但是,你应该考虑升级s3 impl到s3a
,它正在使用aws sdk。
你需要把hadoop-aws jar和aws-java-sdk jar(及其包中的第三方jar)放入你的CLASSPATH。
hadoop-aws:http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/
aws-java-sdk:http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/
然后在core-site.xml中
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>