Question

我的工作代码如下：

sc = SparkContext()
glueContext = GlueContext(sc)
s3_paths = ['01', '02', '03'] #these paths are in the same folder and are partitioned under the source_path
s3_source_path = 'bucket_name/'
for sub_path in s3_paths :
    s3_path = s3_source_path + '/' sub_path
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    # get data from s3 path
    job_DyF = glueContext.create_dynamic_frame.from_options('s3', {"paths": [path], "recurse": True}, "json", format_options={"jsonPath": "$[*]"}, transformation_ctx = "job_DyF")
    
    # write dataset to s3 avro
    data_sink = glueContext.write_dynamic_frame.from_options(frame = df_verify_filtered, connection_type = "s3", connection_options = {"path": "s3://target", "partitionKeys": ["partition_0", "partition_1", "partition_2"]}, format = "avro", transformation_ctx = "data_sink")
    
    job.commit()

作业成功后，某些子路径中缺少记录。

当我尝试再次运行该作业时，它显示为no new file detected。

因此，我尝试使用没有for sub_path in paths的特定sub_path运行代码，奇怪的是，当为sub_path＃2运行作业时出现了问题：

对于子路径“ 02”说no new file detected，

即使作业仅针对第一个sub_path'01'运行，并且只有来自第一个sub_path的数据才被提取到S3 avro。

我无法弄清楚我设置此书签的方式出了什么问题，因此，您的见解将不胜感激！预先感谢。

aws在一个作业中粘合多个文件夹的书签无法运行

0 个答案: