我的工作代码如下:
sc = SparkContext()
glueContext = GlueContext(sc)
s3_paths = ['01', '02', '03'] #these paths are in the same folder and are partitioned under the source_path
s3_source_path = 'bucket_name/'
for sub_path in s3_paths :
s3_path = s3_source_path + '/' sub_path
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# get data from s3 path
job_DyF = glueContext.create_dynamic_frame.from_options('s3', {"paths": [path], "recurse": True}, "json", format_options={"jsonPath": "$[*]"}, transformation_ctx = "job_DyF")
# write dataset to s3 avro
data_sink = glueContext.write_dynamic_frame.from_options(frame = df_verify_filtered, connection_type = "s3", connection_options = {"path": "s3://target", "partitionKeys": ["partition_0", "partition_1", "partition_2"]}, format = "avro", transformation_ctx = "data_sink")
job.commit()
作业成功后,某些子路径中缺少记录。
当我尝试再次运行该作业时,它显示为no new file detected
。
因此,我尝试使用没有for sub_path in paths
的特定sub_path运行代码,奇怪的是,当为sub_path#2运行作业时出现了问题:
对于子路径“ 02”说
no new file detected
,
即使作业仅针对第一个sub_path'01'运行,并且只有来自第一个sub_path的数据才被提取到S3 avro。
我无法弄清楚我设置此书签的方式出了什么问题,因此,您的见解将不胜感激!预先感谢。