我正在尝试让Spark群集从Amazon S3云存储中读取数据源。这会导致以下错误,我需要一些帮助来诊断问题:
>>> sc.textFile("s3a://storage-bucket/s3test.txt").collect()
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: D47397DA8BCB4669, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: /aBi99tozgFEsdRGubDwhriMsNQvl1jLOf8AJquA8VXxzkpPL/LLCWDFQQvYn4snHx5gx66/pXo=
顺便说一句,这样做很好:
$ aws s3 cp s3://storage-bucket/s3test.txt ./s3text.txt
download: s3://storage-bucket/s3test.txt to ./s3text.txt
$ cat s3text.txt
Hello S3
错误消息中的更多细节:
Caused by: org.jets3t.service.S3ServiceException: Service Error Message. -- ResponseCode: 403, ResponseStatus: Forbidden, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>SignatureDoe
sNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><AWSAccessKeyId>xxxxxxxxxxxxxxxxxx</AWSAccessKeyId><St
答案 0 :(得分:1)
您是否可以检查fs.s3a.access.key
和fs.s3a.secret.key
,并确保它们与您用于执行aws s3 cp
测试的凭据相匹配。凭据错误时,可能会显示此SignatureDosNotMatch
错误。试试hdfs fs -ls s3a://storage-bucket/
答案 1 :(得分:1)
你的配置有些东西。 S3A连接器使用AWS SDK。如果您的堆栈跟踪包含jets3t,那么您已经以某种方式将错误的文件系统连接到它。从源代码中删除anythig,它设置有关fs.s3a.impl的属性,并依赖Hadoop运行时对事物进行排序,然后重复