Question

我有一个包含超过 600 万个文件的大存储桶。我遇到了这个错误 Failed to sanitize XML document destined for handler class，我认为这是问题所在：https://github.com/lbroudoux/es-amazon-s3-river/issues/16

有没有办法限制第一次运行时读取的文件数量？

这就是我所拥有的 DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "s3-sat-dth-prd", table_name = "datahub_meraki_user_data", transformation_ctx = "DataSource0")，我可以告诉它只读取存储桶中的一个文件夹吗？其中的每个文件夹都这样调用：partition=13/、partition=14/、partition=n/ 等等。

我该如何解决这个问题？

提前致谢。

Answer 1

有三种主要方法（据我所知）来避免这种情况。

1.从前缀加载

为了从 AWS Glue 中的特定路径加载文件，您可以使用以下语法。

from awsglue.dynamicframe import DynamicFrame

dynamic_frame = context.create_dynamic_frame_from_options(
        "s3",
        {
            'paths': ['s3://my_bucket_1/my_prefix_1'],
            'recurse': True,
            'groupFiles': 'inPartition',
            'groupSize': '1073741824'
        },
        format='json',
        transformation_ctx='DataSource0'
    )

您可以为 paths 放置多个路径，Glue 将从所有路径加载。

2.使用胶水书签。

当您的存储桶中有数百万个文件并且您只想加载新文件（在 Glue 作业运行之间）时，您可以启用 Glue 书签。它将跟踪它在内部索引（我们无权访问）中读取的文件。您可以在定义作业时将其作为参数传递。


  MyJob:
    Type: AWS::Glue::Job
    Properties:
      ...
      GlueVersion: 2.0
      Command:
        Name: glueetl
        PythonVersion: 3
        ...
      DefaultArguments: {
        "--job-bookmark-option": job-bookmark-enable,
        ...

这将启用在加载数据时使用 transformation_ctx 所用名称定义的书签。是的，AWS 出于多种目的使用相同的参数令人困惑！

同样重要的是，您一定不要忘记在 Glue 脚本的末尾添加 job.commit()，其中 job 是您的 from awsglue.job import Job 实例。

然后，当您使用具有根前缀和相同 context.create_dynamic_frame_from_options() 的相同 transformation_ctx 函数时，它只会加载层次结构中前缀中的新文件。它为我们在寻找新文件时省去了很多麻烦。阅读 docs 以了解有关书签的更多信息。

3.避免使用较小的文件。

如果您的文件非常小，AWS Glue 将需要很长时间才能加载文件。因此，如果您可以控制文件大小，则将文件的大小设置为至少 100MB。例如，我们从 Firehose 流写入 S3，我们可以调整缓冲区大小以避免文件变小。这大大增加了我们 Glue 作业的加载时间。

我希望这些提示对您有所帮助。如果您需要进一步说明，请随时提出任何问题。

Answer 2

有一种方法可以控制文件数量，称为 BoundedExecution。它记录在这里：https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html

在以下示例中，您将一次加载 200 个文件。请注意，您必须启用 Glue 书签才能正常工作。

如果您使用 from_options，它看起来像这样：

    DataSource0 = glueContext.create_dynamic_frame.from_options(
        format_options={"withHeader": True, "separator": separator, "quoteChar": quoteChar},
        connection_type="s3",
        format="csv",
        connection_options={"paths": inputFilePath,
                            "boundedFiles": "200", "recurse": True},
        transformation_ctx="DataSource0"
    )

如果您使用的是 from_catalog，它看起来像这样：

    DataSource0 = glueContext.create_dynamic_frame.from_catalog(
        database = "database-name",
        table_name= "table-name",
        additional_options={"boundedFiles": "200"},
        transformation_ctx="DataSource0"
    )

从 S3 存储桶读取的 AWS Glue 限制数据

2 个答案: