在YARN上运行了一个火花工作,大约9个小时后,工作失败了,
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:446)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException:
com.google.api.client.googleapis.json.GoogleJsonResponseException: 410
Gone
{
"code" : 503,
"errors" : [ {
"domain" : "global",
"message" : "Backend Error",
"reason" : "backendError"
} ],
"message" : "Backend Error"
}
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:432)
at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage$WritableByteChannelImpl.close(CacheSupplementedGoogleCloudStorage.java:68)
at java.nio.channels.Channels$1.close(Channels.java:178)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.close(GoogleHadoopOutputStream.java:126)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:400)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetFileFormat.scala:569)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$4.apply$mcV$sp(WriterContainer.scala:422)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$4.apply(WriterContainer.scala:416)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$4.apply(WriterContainer.scala:416)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:438)
... 8 more
Suppressed: java.lang.NullPointerException
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:147)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetFileFormat.scala:569)
at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer$$anonfun$writeRows$5.apply$mcV$sp(WriterContainer.scala:440)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1357)
... 9 more
我在这里看到Dataflow jobs fail after a few 410 errors (while writing to GCS)和How to recover from Cloud Dataflow job failed on com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone
的解决方案但是这些建议在DataFlow上进行分片(而不是DataProc + YARN)。我还看到https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone表示我无法控制失去的可重复会话。
使用
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-storage</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>1.6.0-hadoop2</version>
</dependency>
是否有任何分片,火花,GCP或YARN分区设置可以帮助避免/防止此异常?
答案 0 :(得分:0)
在Dataflow中修复较少数量的分片相当于在任何输出步骤myData.repartition(1000)
之前在Spark作业中添加一个步骤,或者比在该阶段发生的分区更小的其他一些固定数字。当分区数量非常高(>大约10,000个)时确实存在问题。
同样要设置重试次数,您可以在提交时添加作业属性:
gcloud dataproc jobs submit spark --properties spark.task.maxFailures=20 ...
或者,如果您希望在群集创建时设置它:
gcloud dataproc clusters create --properties spark:spark.task.maxFailures=20 ...