当要在AWS EMR

时间:2019-05-23 15:57:42

标签: apache-spark apache-spark-sql amazon-emr

我正在编写一个Spark应用程序,以处理(基本上过滤有用的网页以进行进一步研究)关于AWS EMR上Common Crawl提供的7TB数据,然后将过滤后的网页写入压缩的文本文件(大约300GB) 。原始数据集共有56000个子文件,每个压缩的子文件大约有100MB。我仅使用部分数据(约500个子文件)进行了实验,结果似乎看起来不错(成功完成)。但是问题是,当我在整个数据集上运行应用程序时,总是出现以下错误: “ 文件已存在”错误。顺便说一句,我使用了30台c4.8xlarge机器(根设备EBS卷大小为20GB),其中一台为主机。

我在线搜索了此错误,它说

  

Spark任务可能由于其他原因而失败。重试原始失败*后,它最后抛出此“ IOException:文件已存在”。

所以我试图找到根本原因。出现类似“ 写入行时任务失败”的错误,我认为这可能是原因,但我无法在线找到任何解决方案。我已经奋斗了几天,如果尝试所有解决方案,这将花费我很多钱。如果有人可以帮助我解决问题,我将不胜感激。

代码如下:

val inputFilePath = "s3://commoncrawl/crawl-data/CC-MAIN-2019-18/segments/*/wet/*warc.wet.gz"
val webPages = sc.newAPIHadoopFile(
                inputFilePath,
                classOf[TextInputFormat],
                classOf[LongWritable],
                classOf[Text], cf)
    .map(x => x._2.toString)
    .filter(//some filters here ...)

var dfWebPage = webPages.toDF()

dfWebPage.printSchema()
// it prints: 
// root
//  |-- value: string (nullable = true)

dfWebPage.write.option("maxRecordsPerFile", 50000).format("text")
    .option("compression", "gzip")
    .mode("overwrite").save("s3://bucket-name/output/out")

这是steps/step-name/stderr.gz的日志文件:

19/05/22 13:45:14 INFO Client: 
     client token: N/A
     diagnostics: User class threw exception: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
    at App$.main(App.scala:145)
    at App.main(App.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1782 in stage 3.0 failed 4 times, most recent failure: Lost task 1782.3 in stage 3.0 (TID 2115, ip-172-31-65-55.ec2.internal, executor 25): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://zhaoyin1/output/out/part-01782-807a6367-a6e8-4373-bb1a-4aebcc6b0601-c000.txt.gz
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
    at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:913)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:810)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:212)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.<init>(TextFileFormat.scala:151)
    at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anon$1.newInstance(TextFileFormat.scala:84)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:233)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

以下是containers/.../stderr.gz(主)的日志:

19/05/22 13:38:51 WARN TaskSetManager: Lost task 1782.0 in stage 3.0 (TID 1788, ip-172-31-69-181.ec2.internal, executor 13): org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.StackOverflowError
    at java.lang.Character.getType(Character.java:6924)
    at java.lang.Character$UnicodeScript.of(Character.java:4479)
    at java.util.regex.Pattern$Script.isSatisfiedBy(Pattern.java:3881)
    at java.util.regex.Pattern$CharProperty$1.isSatisfiedBy(Pattern.java:3773)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3778)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4265)

这是containers/.../stderr.gz(从站)的日志:

19/05/22 13:37:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
19/05/22 13:37:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true
19/05/22 13:37:14 INFO DirectFileOutputCommitter: Direct Write: ENABLED
19/05/22 13:37:14 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter
19/05/22 13:37:14 INFO S3NativeFileSystem: Opening 's3://commoncrawl/crawl-data/CC-MAIN-2019-18/segments/1555578517682.16/wet/CC-MAIN-20190418141430-20190418163430-00219.warc.wet.gz' for reading
19/05/22 13:37:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 2
19/05/22 13:37:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true
19/05/22 13:37:14 INFO DirectFileOutputCommitter: Direct Write: ENABLED
19/05/22 13:37:14 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1032.0 in stage 3.0 (TID 1038), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1015.0 in stage 3.0 (TID 1021), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1030.0 in stage 3.0 (TID 1036), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1028.0 in stage 3.0 (TID 1034), reason: Stage cancelled
19/05/22 13:39:04 INFO Executor: Executor is trying to kill task 1059.0 in stage 3.0 (TID 1065), reason: Stage cancelled
19/05/22 13:39:04 ERROR Utils: Aborting task
org.apache.spark.TaskKilledException
    at org.apache.spark.TaskContextImpl.killTaskIfInterrupted(TaskContextImpl.scala:149)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:36)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:241)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

0 个答案:

没有答案