Question

我认为AWS Glue在无法写入镶木地板输出后内存不足...

调用o126.parquet时发生错误。工作因阶段中止失败：阶段9.0中的任务82失败4次，最近一次失败：在阶段9.0中丢失任务82.3（TID 17400， ip-172-31-8-70.ap-southeast-1.compute.internal，执行程序1）： ExecutorLostFailure（执行程序1退出，原因之一是正在运行任务）原因：容器因超出内存限制而被YARN杀死。 5.5 GB使用的5.5 GB物理内存。考虑提高spark.yarn.executor.memoryOverhead。

下面有更完整的日志

回溯（最近通话最近）：文件 “ script_2019-01-29-06-53-53.py”，第71行，在 .parquet（“ s3：//.../flights2”）文件 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py”，实木复合地板文件中的第691行 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py”，第1133行，在通话文件中 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/pyspark.zip/pyspark/sql/utils.py”，装饰文件中的第63行 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py”，第319行，在get_return_value py4j.protocol.Py4JJavaError中：错误发生在调用o126.parquet时。： org.apache.spark.SparkException：作业中止。在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp（FileFormatWriter.scala：213）在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply（FileFormatWriter.scala：166）在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply（FileFormatWriter.scala：166）在 org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId（SQLExecution.scala：65）在 org.apache.spark.sql.execution.datasources.FileFormatWriter $ .write（FileFormatWriter.scala：166）在 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run（InsertIntoHadoopFsRelationCommand.scala：145）在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute（commands.scala：58）在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult（commands.scala：56）在 org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute（commands.scala：74）在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply（SparkPlan.scala：117）在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply（SparkPlan.scala：117）在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply（SparkPlan.scala：138）在 org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在 org.apache.spark.sql.execution.SparkPlan.executeQuery（SparkPlan.scala：135）在 org.apache.spark.sql.execution.SparkPlan.execute（SparkPlan.scala：116）在 org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute（QueryExecution.scala：92）在 org.apache.spark.sql.execution.QueryExecution.toRdd（QueryExecution.scala：92）在 org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat（DataSource.scala：435）在 org.apache.spark.sql.execution.datasources.DataSource.write（DataSource.scala：471）在 org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run（SaveIntoDataSourceCommand.scala：50）在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute（commands.scala：58）在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult（commands.scala：56）在 org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute（commands.scala：74）在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply（SparkPlan.scala：117）在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply（SparkPlan.scala：117）在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply（SparkPlan.scala：138）在 org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在 org.apache.spark.sql.execution.SparkPlan.executeQuery（SparkPlan.scala：135）在 org.apache.spark.sql.execution.SparkPlan.execute（SparkPlan.scala：116）在 org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute（QueryExecution.scala：92）在 org.apache.spark.sql.execution.QueryExecution.toRdd（QueryExecution.scala：92）在 org.apache.spark.sql.DataFrameWriter.runCommand（DataFrameWriter.scala：609）在 org.apache.spark.sql.DataFrameWriter.save（DataFrameWriter.scala：233）在 org.apache.spark.sql.DataFrameWriter.save（DataFrameWriter.scala：217）在 org.apache.spark.sql.DataFrameWriter.parquet（DataFrameWriter.scala：508）在sun.reflect.NativeMethodAccessorImpl.invoke0（本机方法）处 sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:62）在 sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43）在java.lang.reflect.Method.invoke（Method.java:498）在 py4j.reflection.MethodInvoker.invoke（MethodInvoker.java:244）在 py4j.reflection.ReflectionEngine.invoke（ReflectionEngine.java:357）在 py4j.Gateway.invoke（Gateway.java:280）在 py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:132）在py4j.commands.CallCommand.execute（CallCommand.java:79）处 py4j.GatewayConnection.run（GatewayConnection.java:214）在 java.lang.Thread.run（Thread.java:748）由以下原因引起： org.apache.spark.SparkException：由于阶段失败，作业中止了：阶段9.0中的任务82失败4次，最近一次失败：丢失的任务阶段9.0中的82.3（TID 17400，ip-172-31-8-70.ap-southeast-1.compute.internal，执行程序1）： ExecutorLostFailure（执行程序1退出，原因之一是正在运行任务）原因：容器因超出内存限制而被YARN杀死。 5.5 GB使用的5.5 GB物理内存。考虑提高spark.yarn.executor.memoryOverhead。驱动程序堆栈跟踪：位于 org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages（DAGScheduler.scala：1517）在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1505）在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply（DAGScheduler.scala：1504）在 scala.collection.mutable.ResizableArray $ class.foreach（ResizableArray.scala：59）在scala.collection.mutable.ArrayBuffer.foreach（ArrayBuffer.scala：48）在 org.apache.spark.scheduler.DAGScheduler.abortStage（DAGScheduler.scala：1504）在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：814）在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply（DAGScheduler.scala：814）在scala.Option.foreach（Option.scala：257）在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed（DAGScheduler.scala：814）在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive（DAGScheduler.scala：1732）在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1687）在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive（DAGScheduler.scala：1676）在org.apache.spark.util.EventLoop $$ anon $ 1.run（EventLoop.scala：48）在 org.apache.spark.scheduler.DAGScheduler.runJob（DAGScheduler.scala：630）在org.apache.spark.SparkContext.runJob（SparkContext.scala：2029）处 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp（FileFormatWriter.scala：186）

看来失败的行是：

.parquet("s3://pinfare-glue/flights2")

我的胶水工作如下：我可以解决这个问题吗？我正在考虑从S3删除一些文件夹，以便Glue分批处理数据...但这是不可扩展的...

另一件事是，也许我为每个日期创建一个数据框，并在循环中写入这些较小的分区……但这会很慢吗？

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import regexp_replace, to_timestamp

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print(">>> READING ...")
inputGDF = glueContext.create_dynamic_frame.from_catalog(database = "pinfare", table_name = "flights", transformation_ctx="inputGDF")
# inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-actuary-storage-csv"], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")
print(">>> DONE READ ...")

flightsDf = inputGDF.toDF()
if bool(flightsDf.head(1)):
    df = flightsDf \
        .drop("createdat") \
        .drop("updatedat") \
        .withColumn("agent", flightsDf["agent"].cast("int")) \
        .withColumn("querydestinationplace", flightsDf["querydestinationplace"].cast("int")) \
        .withColumn("querydatetime", regexp_replace(flightsDf["querydatetime"], "-", "").cast("int")) \
        .withColumn("queryoutbounddate", regexp_replace(flightsDf["queryoutbounddate"], "-", "").cast("int")) \
        .withColumn("queryinbounddate", regexp_replace(flightsDf["queryinbounddate"], "-", "").cast("int")) \
        .withColumn("outdeparture", to_timestamp(flightsDf["outdeparture"], "yyyy-MM-ddTHH:mm:ss")) \
        .withColumn("outarrival", to_timestamp(flightsDf["outarrival"], "yyyy-MM-ddTHH:mm:ss")) \
        .withColumn("indeparture", to_timestamp(flightsDf["indeparture"], "yyyy-MM-ddTHH:mm:ss")) \
        .withColumn("inarrival", to_timestamp(flightsDf["inarrival"], "yyyy-MM-ddTHH:mm:ss")) \

    df.createOrReplaceTempView("flights")

    airportsGDF = glueContext.create_dynamic_frame.from_catalog(database = "pinfare", table_name = "airports")
    airportsDF = airportsGDF.toDF()
    airportsDF.createOrReplaceTempView("airports")

    agentsGDF = glueContext.create_dynamic_frame.from_catalog(database = "pinfare", table_name = "agents")
    agentsRawDF = agentsGDF.toDF()
    agentsRawDF.createOrReplaceTempView("agents_raw")

    agentsDF = spark.sql("""
        SELECT id, name, type FROM agents_raw
        WHERE type IN ('Airline', 'TravelAgent')
    """) 
    agentsDF.createOrReplaceTempView("agents")

    finalDf = spark.sql("""
            SELECT /*+ BROADCAST(agents) */ /*+ BROADCAST(airports) */
                f.*, countryName, cityName, airportName, a.name AS agentName,
                CONCAT(f.outboundlegid, '-', f.inboundlegid, '-', f.agent) AS key
            FROM flights f
            LEFT JOIN agents a
            ON f.agent = a.id
            LEFT JOIN airports p
            ON f.querydestinationplace = p.airportId
        """)
    print(">>> DONE PROCESS FLIGHTS")

    print("Writing ...")
    finalDf \
      .write \
      .mode("append") \
      .partitionBy(["countryName", "querydatetime"]) \
      .parquet("s3://.../flights2")
else:
    print("Nothing to write ...")

job.commit()

import boto3
glue_client = boto3.client('glue', region_name='ap-southeast-1')
glue_client.start_crawler(Name='...')

Answer 1

如果ur LEFT JOIN具有1：N映射，它将导致DF中的行成指数增长，这可能会导致OOM。在胶水中，没有规定设置您自己的基础配置，例如每个vCPU 64GB内存。如果是这种情况，请首先尝试使用spark.yarn.executor.memoryOverhead选项或/和增加DPU。否则，您必须使用下推谓词存储数据，然后在所有数据上循环运行

AWS Glue无法在内存不足的情况下编写镶木地板

1 个答案: