我认为AWS Glue在无法写入镶木地板输出后内存不足...
调用o126.parquet时发生错误。工作因阶段中止 失败:阶段9.0中的任务82失败4次,最近一次失败: 在阶段9.0中丢失任务82.3(TID 17400, ip-172-31-8-70.ap-southeast-1.compute.internal,执行程序1): ExecutorLostFailure(执行程序1退出,原因之一是正在运行 任务)原因:容器因超出内存限制而被YARN杀死。 5.5 GB使用的5.5 GB物理内存。考虑提高spark.yarn.executor.memoryOverhead。
下面有更完整的日志
回溯(最近通话最近):文件 “ script_2019-01-29-06-53-53.py”,第71行,在 .parquet(“ s3://.../flights2”)文件 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py”, 实木复合地板文件中的第691行 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py”, 第1133行,在通话文件中 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/pyspark.zip/pyspark/sql/utils.py”, 装饰文件中的第63行 “ /mnt/yarn/usercache/root/appcache/application_1548744646207_0001/container_1548744646207_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py”, 第319行,在get_return_value py4j.protocol.Py4JJavaError中:错误 发生在调用o126.parquet时。 : org.apache.spark.SparkException:作业中止。在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:213) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:166) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply(FileFormatWriter.scala:166) 在 org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65) 在 org.apache.spark.sql.execution.datasources.FileFormatWriter $ .write(FileFormatWriter.scala:166) 在 org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145) 在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58) 在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) 在 org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:138) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在 org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 在 org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) 在 org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92) 在 org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 在 org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435) 在 org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471) 在 org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) 在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:58) 在 org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) 在 org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:117) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:138) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在 org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) 在 org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) 在 org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:92) 在 org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 在 org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 在 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) 在 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) 在 org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:508) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:280)在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)处 py4j.GatewayConnection.run(GatewayConnection.java:214)在 java.lang.Thread.run(Thread.java:748)由以下原因引起: org.apache.spark.SparkException:由于阶段失败,作业中止了: 阶段9.0中的任务82失败4次,最近一次失败:丢失的任务 阶段9.0中的82.3(TID 17400,ip-172-31-8-70.ap-southeast-1.compute.internal,执行程序1): ExecutorLostFailure(执行程序1退出,原因之一是正在运行 任务)原因:容器因超出内存限制而被YARN杀死。 5.5 GB使用的5.5 GB物理内存。考虑提高spark.yarn.executor.memoryOverhead。驱动程序堆栈跟踪:位于 org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1517) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1505) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1504) 在 scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814) 在scala.Option.foreach(Option.scala:257)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)处 org.apache.spark.sql.execution.datasources.FileFormatWriter $$ anonfun $ write $ 1.apply $ mcV $ sp(FileFormatWriter.scala:186)
看来失败的行是:
.parquet("s3://pinfare-glue/flights2")
我的胶水工作如下:我可以解决这个问题吗?我正在考虑从S3删除一些文件夹,以便Glue分批处理数据...但这是不可扩展的...
另一件事是,也许我为每个日期创建一个数据框,并在循环中写入这些较小的分区……但这会很慢吗?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import regexp_replace, to_timestamp
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
print(">>> READING ...")
inputGDF = glueContext.create_dynamic_frame.from_catalog(database = "pinfare", table_name = "flights", transformation_ctx="inputGDF")
# inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-actuary-storage-csv"], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")
print(">>> DONE READ ...")
flightsDf = inputGDF.toDF()
if bool(flightsDf.head(1)):
df = flightsDf \
.drop("createdat") \
.drop("updatedat") \
.withColumn("agent", flightsDf["agent"].cast("int")) \
.withColumn("querydestinationplace", flightsDf["querydestinationplace"].cast("int")) \
.withColumn("querydatetime", regexp_replace(flightsDf["querydatetime"], "-", "").cast("int")) \
.withColumn("queryoutbounddate", regexp_replace(flightsDf["queryoutbounddate"], "-", "").cast("int")) \
.withColumn("queryinbounddate", regexp_replace(flightsDf["queryinbounddate"], "-", "").cast("int")) \
.withColumn("outdeparture", to_timestamp(flightsDf["outdeparture"], "yyyy-MM-ddTHH:mm:ss")) \
.withColumn("outarrival", to_timestamp(flightsDf["outarrival"], "yyyy-MM-ddTHH:mm:ss")) \
.withColumn("indeparture", to_timestamp(flightsDf["indeparture"], "yyyy-MM-ddTHH:mm:ss")) \
.withColumn("inarrival", to_timestamp(flightsDf["inarrival"], "yyyy-MM-ddTHH:mm:ss")) \
df.createOrReplaceTempView("flights")
airportsGDF = glueContext.create_dynamic_frame.from_catalog(database = "pinfare", table_name = "airports")
airportsDF = airportsGDF.toDF()
airportsDF.createOrReplaceTempView("airports")
agentsGDF = glueContext.create_dynamic_frame.from_catalog(database = "pinfare", table_name = "agents")
agentsRawDF = agentsGDF.toDF()
agentsRawDF.createOrReplaceTempView("agents_raw")
agentsDF = spark.sql("""
SELECT id, name, type FROM agents_raw
WHERE type IN ('Airline', 'TravelAgent')
""")
agentsDF.createOrReplaceTempView("agents")
finalDf = spark.sql("""
SELECT /*+ BROADCAST(agents) */ /*+ BROADCAST(airports) */
f.*, countryName, cityName, airportName, a.name AS agentName,
CONCAT(f.outboundlegid, '-', f.inboundlegid, '-', f.agent) AS key
FROM flights f
LEFT JOIN agents a
ON f.agent = a.id
LEFT JOIN airports p
ON f.querydestinationplace = p.airportId
""")
print(">>> DONE PROCESS FLIGHTS")
print("Writing ...")
finalDf \
.write \
.mode("append") \
.partitionBy(["countryName", "querydatetime"]) \
.parquet("s3://.../flights2")
else:
print("Nothing to write ...")
job.commit()
import boto3
glue_client = boto3.client('glue', region_name='ap-southeast-1')
glue_client.start_crawler(Name='...')
答案 0 :(得分:1)
如果ur LEFT JOIN具有1:N映射,它将导致DF中的行成指数增长,这可能会导致OOM。在胶水中,没有规定设置您自己的基础配置,例如每个vCPU 64GB内存。如果是这种情况,请首先尝试使用spark.yarn.executor.memoryOverhead选项或/和增加DPU。否则,您必须使用下推谓词存储数据,然后在所有数据上循环运行