我在通过spark-redshift library加载Redshift数据时遇到java.io.IOException: s3n://bucket-name : 400 : Bad Request error
:
Redshift群集和s3桶均位于孟买地区。
这是完整的错误堆栈:
2017-01-13 13:14:22 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, master): java.io.IOException: s3n://bucket-name : 400 : Bad Request
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy10.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at com.databricks.spark.redshift.RedshiftRecordReader.initialize(RedshiftInputFormat.scala:115)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:92)
at com.databricks.spark.redshift.RedshiftFileFormat$$anonfun$buildReader$1.apply(RedshiftFileFormat.scala:80)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:279)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:263)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.jets3t.service.impl.rest.HttpException: 400 Bad Request
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:425)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:279)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1052)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2264)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2193)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120)
at org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174)
... 30 more
这是我的java代码:
SparkContext sparkContext = SparkSession.builder().appName("CreditModeling").getOrCreate().sparkContext();
sparkContext.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
sparkContext.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", fs_s3a_awsAccessKeyId);
sparkContext.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", fs_s3a_awsSecretAccessKey);
sparkContext.hadoopConfiguration().set("fs.s3a.endpoint", "s3.ap-south-1.amazonaws.com");
SQLContext sqlContext=new SQLContext(sparkContext);
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
通过执行以下更改(引用this),我能够解决火花本地模式上的问题:
1)我已将jets3t jar替换为0.9.4
2)更改了jets3t配置属性以支持aws4版本存储桶,如下所示:
Jets3tProperties myProperties = Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
但现在我正在尝试以群集模式运行 ( spark独立模式或使用资源管理器MESOS )并且错误再次出现:(
任何帮助将不胜感激!
答案 0 :(得分:3)
实际问题:
更新Jets3tProperties,以支持AWS s3签名版本4,在运行时工作在本地模式但不在集群模式下,因为属性仅在驱动程序JVM上更新,但不在任何执行程序JVM上更新。
解决方案:
我找到了一种解决方法,通过引用this链接来更新所有执行程序上的Jets3tProperties。
通过参考上面的链接,我添加了一个额外的代码片段来更新Jets3tProperties,在.foreachPartition()函数中,它将为任何执行器上创建的第一个分区运行它。
以下是代码:
Dataset dataset= sqlContext
.read()
.format("com.databricks.spark.redshift")
.option("url", redshiftUrl)
.option("query", query)
.option("aws_iam_role", aws_iam_role)
.option("tempdir", "s3a://bucket-name/temp-dir")
.load();
dataset.foreachPartition(rdd -> {
boolean first=true;
if(first){
Jets3tProperties myProperties =
Jets3tProperties.getInstance(Constants.JETS3T_PROPERTIES_FILENAME);
myProperties.setProperty("s3service.s3-endpoint", "s3.ap-south-1.amazonaws.com");
myProperties
.setProperty("storage-service.request-signature-version", "AWS4-HMAC-SHA256");
myProperties.setProperty("uploads.stream-retry-buffer-size", "2147483646");
first = false;
}
});
答案 1 :(得分:1)
该堆栈意味着您正在使用基于jets3t的旧s3n连接器。您正在设置仅适用于较新的S3a的权限。使用类似s3a://的URL来获取新条目。
鉴于您尝试使用V4 API,您还需要设置fs.s3a.endpoint。如果您尝试使用针对中心端点的v4进行身份验证,那么400 /错误请求响应就是您所看到的响应