更改列名称时,AWS Glue作业因OOM异常而失败

时间:2019-06-07 19:32:46

标签: pyspark etl aws-glue

我有一个ETL作业,在该作业中,我将来自S3的一些数据加载到动态框架中,进行关联,然后遍历返回的动态框架。我想稍后在Athena中查询其结果,所以我想更改具有“。”的列的名称。到“ _”并小写。进行此转换时,我将DynamicFrame更改为spark数据帧,并一直采用这种方式。我还在另一个SO问题中看到了一个问题,事实证明,据报告AWS Glue重命名字段转换存在问题,因此我避免了这一问题。

我尝试了几件事,包括将负载限制大小增加到50MB,使用dataframe.schema.namesdataframe.columns重新使用reduce而不是循环,使用sparksql重新划分数据帧进行更改,但没有任何效果。我可以肯定地说它的这种转换失败了,因为我已经输入了一些打印语句,并且在转换完成后我所拥有的打印永远不会出现。我曾经使用过UDF,但是那也失败了。我已经尝试过使用df.toDF(new_column_names)df.withColumnRenamed()进行实际的转换,但是它从来没有走过这么远,因为我还没有看到它在检索列名之前就走了。这是我一直在使用的代码。如上所述,我一直在更改实际的名称转换,但其余部分几乎保持不变。

我已经看到有人尝试使用spark.executor.memoryspark.driver.memoryspark.executor.memoryOverheadspark.driver.memoryOverhead。我已经使用过这些,并将它们设置为最大的AWS Glue会让您受益,但无济于事。

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import explode, col, lower, trim, regexp_replace

import copy
import json
import boto3
import botocore
import time

# ========================================================
#                   UTILITY FUNCTIONS
# ========================================================
def lower_and_pythonize(s=None):
    if s is not None:
        return s.replace('.', '_').lower()

    else:
        return None

# pyspark implementation of renaming
# exprs = [
#     regexp_replace(lower(trim(col(c))),'\.' , '_').alias(c) if t == "string" else col(c) 
#     for (c, t) in data_frame.dtypes
# ]
# ========================================================
#                  END UTILITY FUNCTIONS
# ========================================================

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

#my params
bucket_name = '<my-s3-bucket>'                                                            # name of the bucket. do not include 's3://' thats added later
output_key = '<my-output-path>'                                                             # key where all of the output is saved
input_keys = ['<root-directory-i'm using']                                                  # highest level key that holds all of the desired data
s3_exclusions =  "[\"*.orc\"]"                                                       # list  of strings to exclude. Documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
s3_exclusions = s3_exclusions.replace('\n', '')
dfc_root_table_name = 'root'                                                # name of the root table generated in the relationalize process
input_paths = ['s3://' + bucket_name + '/' + x for x in input_keys]         # turn input keys into s3 paths
output_connection_opts = {"path": "s3://" + bucket_name + "/" + output_key} # dict of options. Documentation link found above the write_dynamic_frame.from_options line
s3_client = boto3.client('s3', 'us-east-1')                                              # s3 client used for writing to s3
s3_resource = boto3.resource('s3', 'us-east-1')                                          # s3 resource used for checking if key exists
group_mb = 50                                                               # NOTE: 75 has proven to be too much when running on all of the april data
group_size  = str(group_mb * 1024 * 1024)
input_connection_opts = {'paths': input_paths, 
                         'groupFiles': 'inPartition', 
                         'groupSize': group_size,
                         'recurse': True, 
                         'exclusions': s3_exclusions}                       # dict of options. Documentation link found above the create_dynamic_frame_from_options line    

print(sc._conf.get('spark.executor.cores'))       
num_paritions = int(sc._conf.get('spark.executor.cores')) * 4


print('Loading all json files into DynamicFrame...')
loading_time = time.time()
df = glueContext.create_dynamic_frame_from_options(connection_type='s3', connection_options=input_connection_opts, format='json')
print('Done. Time to complete: {}s'.format(time.time() - loading_time))

# using the list of known null fields (at least on small sample size) remove them
#df = df.drop_fields(drop_paths)    
# drop any remaining null fields. The above covers known problems that this step doesn't fix
print('Dropping null fields...')
dropping_time =  time.time()
df_without_null = DropNullFields.apply(frame=df, transformation_ctx='df_without_null')
print('Done. Time to complete: {}s'.format(time.time() - dropping_time))

df = None
print('Relationalizing dynamic frame...')
relationalizing_time = time.time()
dfc = Relationalize.apply(frame=df_without_null, name=dfc_root_table_name, info="RELATIONALIZE", transformation_ctx='dfc', stageThreshold=3)
print('Done. Time to complete: {}s'.format(time.time() - relationalizing_time))

keys = dfc.keys()
keys.sort(key=lambda s: len(s))

print('Writting all dynamic frames to s3...')
writting_time = time.time()
for key in keys:
    good_key = lower_and_pythonize(s=key)
    data_frame = dfc.select(key).toDF()

    # lowercase all the names and remove '.'
    print('Removing . and _ from names for {} frame...'.format(key))
    df_fix_names_time = time.time()
    print('Repartitioning data frame...')
    data_frame.repartition(num_paritions)
    print('Done.')

    # 
    print('Changing names...')
    for old_name in data_frame.schema.names:
        data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
    print('Done.')
    #

    df_now = DynamicFrame.fromDF(dataframe=data_frame, glue_ctx=glueContext, name='df_now')
    print('Done. Time to complete: {}'.format(time.time() - df_fix_names_time))

    # if a conflict of types appears, make it 2 columns
    # https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
    print('Fixing any type conficts for {} frame...'.format(key))
    df_resolve_time = time.time()
    resolved = ResolveChoice.apply(frame = df_now, choice = 'make_cols', transformation_ctx = 'resolved')
    print('Done. Time to complete: {}'.format(time.time() - df_resolve_time))

    # check if key exists in s3. if not make one
    out_connect = copy.deepcopy(output_connection_opts)
    out_connect['path'] = out_connect['path'] + '/' + str(good_key)
    try: 
        s3_resource.Object(bucket_name, output_key + '/' + good_key + '/').load()
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == '404' or 'NoSuchKey' in e.response['Error']['Code']:
            # object doesn't exist
            s3_client.put_object(Bucket=bucket_name, Key=output_key+'/'+good_key + '/')
        else:
            print(e)

    ## https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
    print('Writing {} frame to S3...'.format(key))
    df_writing_time = time.time()
    datasink4 = glueContext.write_dynamic_frame.from_options(frame = df_now, connection_type = "s3", connection_options = out_connect, format = "orc", transformation_ctx = "datasink4")
    out_connect = None
    datasink4 = None
    print('Done. Time to complete: {}'.format(time.time() - df_writing_time))

print('Done. Time to complete: {}s'.format(time.time() - writting_time))
job.commit()

这是我遇到的错误

19/06/07 16:33:36 DEBUG Client: 
client token: N/A
diagnostics: Application application_1559921043869_0001 failed 1 times due to AM Container for appattempt_1559921043869_0001_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001Then, click on links to logs of each attempt.
Diagnostics: Container [pid=9630,containerID=container_1559921043869_0001_01_000001] is running beyond physical memory limits. Current usage: 5.6 GB of 5.5 GB physical memory used; 8.8 GB of 27.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1559921043869_0001_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 9630 9628 9630 9630 (bash) 0 0 115822592 675 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2019-06-07-15-29-50.py' --arg '--JOB_NAME' --arg 'tss-json-to-orc' --arg '--JOB_ID' --arg 'j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe' --arg '--JOB_RUN_ID' --arg 'jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-059866946490-us-east-1/zmcgrath' --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001/stderr 
|- 9677 9648 9630 9630 (python) 12352 2628 1418354688 261364 python runscript.py script_2019-06-07-15-29-50.py --JOB_NAME tss-json-to-orc --JOB_ID j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --JOB_RUN_ID jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath 
|- 9648 9630 9630 9630 (java) 265906 3083 7916974080 1207439 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx5120m -Djava.io.tmpdir=/mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Djavax.net.ssl.trustStore=ExternalAndAWSTrustStore.jks -Djavax.net.ssl.trustStoreType=JKS -Djavax.net.ssl.trustStorePassword=amazon -DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem -DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem -DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1559921043869_0001/container_1559921043869_0001_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.deploy.PythonRunner --primary-py-file runscript.py --arg script_2019-06-07-15-29-50.py --arg --JOB_NAME --arg tss-json-to-orc --arg --JOB_ID --arg j_f9f7363e5d8afa20784bc83d7821493f481a78352641ad2165f8f68b88c8e5fe --arg --JOB_RUN_ID --arg jr_a77087792dd74231be1f68c1eda2ed33200126b8952c5b1420cb6684759cf233 --arg --job-bookmark-option --arg job-bookmark-disable --arg --TempDir --arg s3://aws-glue-temporary-059866946490-us-east-1/zmcgrath --properties-file /mnt/yarn/usercache/root/appcache/application_1559921043869_0001/container_1559921043869_0001_01_000001/__spark_conf__/__spark_conf__.properties 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1559921462650
final status: FAILED
tracking URL: http://ip-172-32-9-38.ec2.internal:8088/cluster/app/application_1559921043869_0001
user: root

这是作业中的日志内容

LogType:stdout
Log Upload Time:Fri Jun 07 16:33:36 +0000 2019
LogLength:487
Log Contents:
4
Loading all json files into DynamicFrame...
Done. Time to complete: 59.5056920052s
Dropping null fields...
null_fields [<some fields that were dropped>]
Done. Time to complete: 529.95293808s
Relationalizing dynamic frame...
Done. Time to complete: 2773.11689401s
Writting all dynamic frames to s3...
Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout

正如我之前所说,更改名称后的Done.打印永远不会出现在日志中。我已经看到很多人遇到与我相同的错误,并且我尝试了很多他们都没有成功。您能提供的任何帮助将不胜感激。让我知道您是否需要更多信息。谢谢

修改

Prabhakar的评论提醒我,我已经尝试过AWS Glue中的内存工作程序类型,但仍然失败。如上所述,我曾尝试将memoryOverhead中的内存量从5增加到12,但是还是有帮助的。这些都不能使工作成功完成

更新

我输入以下代码来更改列名,而不是上面的代码,以便于调试

print('Changing names...')
name_counter = 0
for old_name in data_frame.schema.names:
    print('Name number {}. name being changed: {}'.format(name_counter, old_name))
    data_frame = data_frame.withColumnRenamed(old_name, old_name.replace('.','_').lower())
    name_counter += 1
print('Done.')

我得到以下输出

Removing . and _ from names for root frame...
Repartitioning data frame...
Done.
Changing names...
End of LogType:stdout

因此data_frame.schema.names部分肯定有问题。这可能与我在所有DynamicFrames中的循环一致吗?我是否从relationalize转换正确地遍历了DynamicFrames?

更新2 胶水最近添加了更多详细的日志,我发现了这个

ERROR YarnClusterScheduler: Lost executor 396 on ip-172-32-78-221.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

这种情况不仅发生在这个执行者身上;看起来几乎都一​​样。
我可以尝试增加执行程序的内存开销,但是我想知道为什么获取列名会导致OOM错误。我认为微不足道的事情会占用那么多内存吗?

更新 我尝试同时使用spark.driver.memoryOverhead=7gspark.yarn.executor.memoryOverhead=7g来运行作业,但再次出现OOM错误

0 个答案:

没有答案