我正在编写一个Spark应用程序,以处理(基本上过滤有用的网页以进行进一步研究)关于AWS EMR上Common Crawl提供的7TB数据,然后将过滤后的网页写入压缩的文本文件(大约300GB) 。原始数据集共有56000个子文件,每个压缩的子文件大约有100MB。我仅使用部分数据(约500个子文件)进行了实验,结果似乎看起来不错(成功完成)。但是问题是,当我在整个数据集上运行应用程序时,总是出现以下错误: “ 文件已存在”错误。顺便说一句,我使用了30台c4.8xlarge机器(根设备EBS卷大小为20GB),其中一台为主机。
我在线搜索了此错误,它说
Spark任务可能由于其他原因而失败。重试原始失败*后,它最后抛出此“ IOException:文件已存在”。
所以我试图找到根本原因。出现类似“ 写入行时任务失败”的错误,我认为这可能是原因,但我无法在线找到任何解决方案。我已经奋斗了几天,如果尝试所有解决方案,这将花费我很多钱。如果有人可以帮助我解决问题,我将不胜感激。
代码如下:
val inputFilePath = "s3://commoncrawl/crawl-data/CC-MAIN-2019-18/segments/*/wet/*warc.wet.gz"
val webPages = sc.newAPIHadoopFile(
inputFilePath,
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text], cf)
.map(x => x._2.toString)
.filter(//some filters here ...)
var dfWebPage = webPages.toDF()
dfWebPage.printSchema()
// it prints:
// root
// |-- value: string (nullable = true)
dfWebPage.write.option("maxRecordsPerFile", 50000).format("text")
.option("compression", "gzip")
.mode("overwrite").save("s3://bucket-name/output/out")
这是steps/step-name/stderr.gz
的日志文件:
19/05/22 13:45:14 INFO Client:
client token: N/A
diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at App$.main(App.scala:145)
at App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1782 in stage 3.0 failed 4 times, most recent failure: Lost task 1782.3 in stage 3.0 (TID 2115, ip-172-31-65-55.ec2.internal, executor 25): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://zhaoyin1/output/out/part-01782-807a6367-a6e8-4373-bb1a-4aebcc6b0601-c000.txt.gz
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:601)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:810)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:212)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.<init>(TextFileFormat.scala:151)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anon$1.newInstance(TextFileFormat.scala:84)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
以下是containers/.../stderr.gz
(主)的日志:
19/05/22 13:38:51 WARN TaskSetManager: Lost task 1782.0 in stage 3.0 (TID 1788, ip-172-31-69-181.ec2.internal, executor 13): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.StackOverflowError
at java.lang.Character.getType(Character.java:6924)
at java.lang.Character$UnicodeScript.of(Character.java:4479)
at java.util.regex.Pattern$Script.isSatisfiedBy(Pattern.java:3881)
at java.util.regex.Pattern$CharProperty$1.isSatisfiedBy(Pattern.java:3773)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3778)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
这是containers/.../stderr.gz
(从站)的日志:
19/05/22 13:37:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
19/05/22 13:37:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true
19/05/22 13:37:14 INFO DirectFileOutputCommitter: Direct Write: ENABLED
19/05/22 13:37:14 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter
19/05/22 13:37:14 INFO S3NativeFileSystem: Opening 's3://commoncrawl/crawl-data/CC-MAIN-2019-18/segments/1555578517682.16/wet/CC-MAIN-20190418141430-20190418163430-00219.warc.wet.gz' for reading
19/05/22 13:37:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
19/05/22 13:37:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true
19/05/22 13:37:14 INFO DirectFileOutputCommitter: Direct Write: ENABLED
19/05/22 13:37:14 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1032.0 in stage 3.0 (TID 1038), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1015.0 in stage 3.0 (TID 1021), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1030.0 in stage 3.0 (TID 1036), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1028.0 in stage 3.0 (TID 1034), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1059.0 in stage 3.0 (TID 1065), reason: Stage cancelled
19/05/22 13:39:04 ERROR Utils: Aborting task
org.apache.spark.TaskKilledException
at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:241)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)