我在S3的单个文件夹中有大量的json文件-大约2TB。我想将这些文件组织到使用json对象中的时间戳计算的分区中,并将数据转换为镶木地板格式。但是,尽管我的工作脚本可以找到文件的测试集,但它从未通过完整的数据集进行查找。在改组过程中似乎遇到了问题。
我尝试过:
添加更多DPU:10、20、50
使用大内存DPU
使用带有groupSize和groupFiles的胶水Context.create_dynamic_frame.from_options()
作业因“无法解析XML文档”错误或“无法删除密钥:/ {目标文件夹路径} / _ temporary”而失败。前者发生在我尝试使用glugContext.create_dynamic_frame.from_catalog()读取数据时,后者发生在我将glusterContext.create_dynamic_frame.from_options()与groupSize ='1048576'一起使用时。
脚本如下:
import sys
from awsglue.transforms import ApplyMapping, Map
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import time
import re
from datetime import datetime as dt
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# datasource0 = glueContext.create_dynamic_frame.from_catalog(
# database="features",
# table_name="raw_dump",
# transformation_ctx="datasource0")
datasource0 = glueContext.create_dynamic_frame.from_options(
"s3",
{'paths': ["s3://<bucket>/production/raw_dump/"],
"recurse": True,
'groupFiles': 'inPartition',
'groupSize': '1048576'},
format='json'
)
mappings = [
("feature_id.s", "string", "feature_id", "string"),
("env.s", "string", "env", "string"),
("value.s", "string", "value", "string"),
("timestamp.n", "string", "timestamp", "long")]
applymapping1 = ApplyMapping.apply(
frame=datasource0,
mappings=mappings,
transformation_ctx="applymapping1")
def map_function(rec):
# Assuming the processing timestamp is the same as the timestamp.
date_object = dt.fromtimestamp(int(rec['timestamp']))
rec['year'] = date_object.year
rec['month'] = date_object.month
rec['day'] = date_object.day
rec['hour'] = date_object.hour
return rec
map2 = Map.apply(
frame=applymapping1,
f=map_function,
transformation_ctx="map2")
connection_options = {
"path": "s3://<bucket>/production/firehose/",
"partitionKeys": ["year", "month", "day", "hour"]
}
datasink2 = glueContext.write_dynamic_frame.from_options(
frame=map2,
connection_type="s3",
connection_options=connection_options,
format="parquet",
transformation_ctx="datasink2")
job.commit()
我希望将文件保存在s3:/<bucket>/production/firehose/year={year}/month={month}/day={day}/hour={hour}/
之类的文件夹中
有什么想法吗?