Question

根据AWS Glue文档，当连接类型为exlusions时，我们可以使用s3排除文件：

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html

“ exclusions”：（可选）包含要排除的Unix样式的glob模式的JSON列表的字符串。例如，“ [\” **。pdf \“]”排除所有PDF文件。有关AWS Glue支持的glob语法的更多信息，请参阅包含和排除模式。

我的s3存储桶喜欢以下内容，我想排除test1文件夹。

/mykkkkkk-test
   test1/
      testfolder/
         11.json
         22.json
   test2/
      1.json
   test3/
      2.json
   test4/
      3.json
   test5/
      4.json

我使用以下代码排除test1文件夹，但它仍将ETL文件放在我的test1文件夹下，并且不起作用

datasource0 = glueContext.create_dynamic_frame_from_options("s3",
    {'paths': ["s3://mykkkkkk-test/"],
    'exclusions': "[\"test1/**\"]",
    'recurse':True,
    'groupFiles': 'inPartition',
    'groupSize': '1048576'}, 
    format="json",
    transformation_ctx = "datasource0")

exclusions是否真的可以在ETL pyspark脚本中使用？我也尝试了以下方法，但没有效果

'exclusions': "[\"test1/**\"]",
'exclusions': ["test1/**"],
'exclusions': "[\"test1\"]",

Answer 1

尝试使用排除的完整路径。

result <- purrr::map_dfc(df1, keep_vals)

排除不适用于AWS Glue ELT作业s3连接

1 个答案: