错误发生:“容器因超出内存限制而被YARN杀死。”

时间:2019-10-11 14:01:19

标签: apache-spark pyspark aws-glue

ErrorMessage': 'An error occurred while calling o103.pyWriteDynamicFrame. 
Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 5.0 
(TID 131, ip-1-2-3-4.eu-central-1.compute.internal, executor 20): 
ExecutorLostFailure (executor 20 exited caused by one of the running tasks) 
Reason: Container killed by YARN for exceeding memory limits.  5.5 GB of 
5.5 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead or disabling 
yarn.nodemanager.vmem-check-enabled because of YARN-4714.

工作正在执行此操作(伪代码):

  1. 将CSV读入DyanamicFrame dynf
  2. `dynf.toDF()。repartition(100)
  3. Map.apply(dyndf, tf) # tf being function applied on every row
  4. `dynf.toDF()。coalesce(10)
  5. dyndf拼写为S3

使用相同的Glue设置(标准工人的MaxCapacity为10.0)已成功执行了数十次此作业,并且在不进行任何调整的情况下,通常可以成功地重新执行失败的CSV。含义:有效。不仅如此。甚至可以用比失败大得多的CSV成功完成工作。

这就是我不稳定的意思。我看不到像CSV大于X那样的图案,那么我需要更多的工人或类似的东西。

有人知道这个错误可能是随机产生的吗?


代码的相关部分:

import sys

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

# s3://bucket/path/object
args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'SOURCE_BUCKET', # "bucket"
    'SOURCE_PATH',   # "path/"
    'OBJECT_NAME',   # "object"
    'TARGET_BUCKET', # "bucket"
    'TARGET_PATH',   # "path/"
    'PARTS_LOAD',
    'PARTS_SAVE'
])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

data_DYN = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    format="csv",
    connection_options={
        "paths":[
            "s3://{sb}/{sp}{on}".format(
                sb=args['SOURCE_BUCKET'],
                sp=args['SOURCE_PATH'],
                on=args['OBJECT_NAME']
            )
        ]
    },
    format_options={
        "withHeader": True,
        "separator": ","
    }
)

data_DF = data_DYN.toDF().repartition(int(args["PARTS_LOAD"]))
data_DYN = DynamicFrame.fromDF(data_DF, glueContext, "data_DYN")

def tf(rec):
    # functions applied to elements of rec
    return rec

data_DYN_2 = Map.apply(data_DYN, tf)

cols = [
    'col1', 'col2', ...
]

data_DYN_3 = SelectFields.apply(data_DYN_2, cols)

data_DF_3 = data_DYN_3.toDF().cache()

data_DF_4 = data_DF_3.coalesce(int(args["PARTS_SAVE"]))
data_DYN_4 = DynamicFrame.fromDF(data_DF_4, glueContext, "data_DYN_4")

datasink = glueContext.write_dynamic_frame.from_options(
    frame = data_DYN_4, 
    connection_type = "s3", 
    connection_options = {
        "path": "s3://{tb}/{tp}".format(tb=args['TARGET_BUCKET'],tp=args['TARGET_PATH']),
        "partitionKeys": ["col_x","col_y"]
    }, 
    format = "parquet",
    transformation_ctx = "datasink"
)

job.commit()

1 个答案:

答案 0 :(得分:1)

我会怀疑output = [10, 20, 30, 15]是元凶,因为分区数量减少了100-> 10,而又没有在它们之间重新平衡数据。相反,.coalesce(10)可能会解决此问题,但会造成额外的混乱。