粘合作业以将S3中单个文件夹中的大量文件分区到多个文件夹失败

时间:2019-09-06 18:27:18

标签: pyspark aws-glue

我在S3的单个文件夹中有大量的json文件-大约2TB。我想将这些文件组织到使用json对象中的时间戳计算的分区中,并将数据转换为镶木地板格式。但是,尽管我的工作脚本可以找到文件的测试集,但它从未通过完整的数据集进行查找。在改组过程中似乎遇到了问题。

我尝试过:

  • 添加更多DPU:10、20、50

  • 使用大内存DPU

  • 使用带有groupSize和groupFiles的胶水Context.create_dynamic_frame.from_options()

作业因“无法解析XML文档”错误或“无法删除密钥:/ {目标文件夹路径} / _ temporary”而失败。前者发生在我尝试使用glugContext.create_dynamic_frame.from_catalog()读取数据时,后者发生在我将glusterContext.create_dynamic_frame.from_options()与groupSize ='1048576'一起使用时。

脚本如下:

import sys
from awsglue.transforms import ApplyMapping, Map
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

import time
import re
from datetime import datetime as dt

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# datasource0 = glueContext.create_dynamic_frame.from_catalog(
#     database="features",
#     table_name="raw_dump",
#     transformation_ctx="datasource0")
datasource0 = glueContext.create_dynamic_frame.from_options(
    "s3",
    {'paths': ["s3://<bucket>/production/raw_dump/"],
     "recurse": True,
     'groupFiles': 'inPartition',
     'groupSize': '1048576'},
    format='json'
    )

mappings = [
    ("feature_id.s", "string", "feature_id", "string"),
    ("env.s", "string", "env", "string"),
    ("value.s", "string", "value", "string"),
    ("timestamp.n", "string", "timestamp", "long")]

applymapping1 = ApplyMapping.apply(
    frame=datasource0,
    mappings=mappings,
    transformation_ctx="applymapping1")

def map_function(rec):

    # Assuming the processing timestamp is the same as the timestamp.
    date_object = dt.fromtimestamp(int(rec['timestamp']))
    rec['year'] = date_object.year
    rec['month'] = date_object.month
    rec['day'] = date_object.day
    rec['hour'] = date_object.hour

    return rec

map2 = Map.apply(
    frame=applymapping1,
    f=map_function,
    transformation_ctx="map2")

connection_options = {
    "path": "s3://<bucket>/production/firehose/",
    "partitionKeys":  ["year", "month", "day", "hour"]
    }

datasink2 = glueContext.write_dynamic_frame.from_options(
    frame=map2,
    connection_type="s3",
    connection_options=connection_options,
    format="parquet",
    transformation_ctx="datasink2")

job.commit()

我希望将文件保存在s3:/<bucket>/production/firehose/year={year}/month={month}/day={day}/hour={hour}/之类的文件夹中

有什么想法吗?

0 个答案:

没有答案