对于s3

时间:2017-11-24 06:14:05

标签: amazon-web-services amazon-s3 pyspark spark-dataframe aws-glue

对于小s3输入文件(~10GB),胶水ETL作业正常,但对于较大的数据集(~200GB),作业失败。

添加部分ETL代码。

# Converting Dynamic frame to dataframe
df = dropnullfields3.toDF()

# create new partition column
partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))

# store the data in parquet format on s3 
partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")

作业执行了4个小时并且犯了错误。

  

文件" script_2017-11-23-15-07-32.py",第49行,in   partitioned_dataframe.write.partitionBy([' part_date'])格式。("拼花&#34)。保存(output_lg_partitioned_dir,   mode ="追加")文件   " /mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py" ;,   第550行,在保存文件中   " /mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py" ;,   第1133行,在调用文件中   " /mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py" ;,   第63行,在deco文件中   " /mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py" ;,   第319行,在get_return_value py4j.protocol.Py4JJavaError:错误   在调用o172.save时发生。 :org.apache.spark.SparkException:   工作中止了。在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用$ MCV $ SP(FileFormatWriter.scala:147)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用(FileFormatWriter.scala:121)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用(FileFormatWriter.scala:121)   在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57)   在   org.apache.spark.sql.execution.datasources.FileFormatWriter $ .WRITE(FileFormatWriter.scala:121)   在   org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)   在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58)   在   org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)   在   org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)   在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.适用(SparkPlan.scala:114)   在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $执行$ 1.适用(SparkPlan.scala:114)   在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ $的executeQuery 1.适用(SparkPlan.scala:135)   在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)   在   org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)   在   org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)   在   org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:87)   在   org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)   在   org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)   在   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)   在   org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)   在java.lang.reflect.Method.invoke(Method.java:498)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at   py4j.Gateway.invoke(Gateway.java:280)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)   在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:214)at   java.lang.Thread.run(Thread.java:748)引起:   org.apache.spark.SparkException:作业因阶段失败而中止:   3385个任务(1024.1 MB)的序列化结果的总大小更大   而不是spark.driver.maxResultSize(1024.0 MB)at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1435)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1423)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1422)   在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)   在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)   在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:802)   在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:802)   在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)   在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)   在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)   在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)   在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)at at   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)   在org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)at at   org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $写$ 1.适用$ MCV $ SP(FileFormatWriter.scala:127)   ......还有30多个

     

LogType结束:stdout

如果您能提供解决此问题的任何指导,我将不胜感激。

1 个答案:

答案 0 :(得分:2)

您只能在上下文实例化期间设置maxResultSize等可配置选项,而glue会为您提供上下文(从内存中无法实例化新上下文)。我认为您无法更改此属性的值。

如果您向驱动程序收集超过指定大小的结果,通常会收到此错误。在这种情况下你不是那样做的,所以错误令人困惑。

您似乎正在产生3385个任务,这些任务可能与您输入文件中的日期有关(3385个日期,〜9年?)。您可以尝试批量编写此文件,例如

partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date'))
for year in range(2000,2018):
    partitioned_dataframe = partitioned_dateframe.where(year(part_date) = year)
    partitioned_dataframe.write.partitionBy(['part_date'])
        .format("parquet")
        .save(output_lg_partitioned_dir, mode="append")

我没有检查过这段代码;你至少需要导入pyspark.sql.functions.year才能使用它。

当我使用Glue进行数据处理时,我发现批处理工作比尝试成功完成大型数据集更有效。系统很好但很难调试;大数据的稳定性并不容易。