Apache Spark写入s3无法从临时文件夹移动镶木地板文件

时间:2016-09-29 19:33:42

标签: apache-spark amazon-s3 spark-dataframe parquet

我有一个8小时的工作(火花2.0.0),使用标准方法将结果写入实木复合地板:

processed_images_df.write.format("parquet").save(s3_output_path) 

它执行10000个任务并将结果写入_temporary文件夹,在最后一步(完成所有任务后),它从_temporary文件夹复制镶木地板文件,但复制了大约2-3000个文件后,它会失败以下(首先我认为这是一个暂时的s3失败,但我重新运行3次并得到同样的错误):

org.apache.spark.SparkException: Job aborted. 
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:149) 
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) 
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) 
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) 
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) 
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) 
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) 
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) 
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) 
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) 
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) 
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) 
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) 
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) 
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) 
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) 
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) 
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
        at java.lang.reflect.Method.invoke(Method.java:606) 
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
        at py4j.Gateway.invoke(Gateway.java:280) 
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 
        at py4j.commands.CallCommand.execute(CallCommand.java:79) 
        at py4j.GatewayConnection.run(GatewayConnection.java:211) 
        at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.http.NoHttpResponseException: s3-bucket.s3.amazonaws.com:443 failed to respond 
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) 
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) 
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) 
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) 
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:259) 
        at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:232) 
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) 
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) 
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686) 
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:488) 
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884) 
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) 
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) 
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326) 
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277) 
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1143) 
        at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2117) 
        at org.jets3t.service.StorageService.copyObject(StorageService.java:898) 
        at org.jets3t.service.StorageService.copyObject(StorageService.java:943) 
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:320) 
        at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
        at java.lang.reflect.Method.invoke(Method.java:606) 
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) 
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) 
        at org.apache.hadoop.fs.s3native.$Proxy20.copy(Unknown Source) 
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:645) 
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:345) 
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:362) 
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) 
        at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46) 
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) 
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
        ... 29 more

1 个答案:

答案 0 :(得分:10)

我发现此问题的解决方案是将Hadoop更新为2.7并设置

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2

在Spark 1.6中,有一个替代版本的fileoutputcommiter直接写入s3,但它在spark 2.0.0中被弃用:https://issues.apache.org/jira/browse/SPARK-10063