AWS Glue:在作业书签进行增量加载时无法推断架构

时间:2020-02-11 10:08:00

标签: amazon-s3 parquet aws-glue aws-glue-data-catalog

我正在研究一个AWS Glue作业,该作业将S3中的分区数据(镶木地板文件)与作业书签一起使用。尝试将作业书签功能用于每日增量加载时遇到了问题。 这是我读取数据的方式:

val push: String = "p_date > '" + start + "' and (attribute=='x' or attribute=='y')"
logger.info("Using pushdown predicate: " + push)
val source = glueContext
      .getCatalogSource(database = "testbase", tableName = "testtable", pushDownPredicate = push,
transformationContext = "source").getDynamicFrame()

这是AWS Glue生成的Input-files.json,它是在初始满载后立即使用作业书签逻辑创建的。不应处理任何新数据,似乎可以正确显示空白的“文件”部分。

[{
        "path": "s3://path/to/bucket/attribute=x",
        "files": []
    }, {
        "path": "s3://path/to/bucket/attribute=y",
        "files": []
    }]

但是,发生的是这样,而不是记录文件已被跳过:

After final job bookmarks filter, processing 0.00% of 0 files in partition DynamicFramePartition(com.amazonaws.services.glue.DynamicRecord@7d679e8a,s3://path/to/bucket/attribute=x,1578972694000). 
After final job bookmarks filter, processing 0.00% of 0 files in partition DynamicFramePartition(com.amazonaws.services.glue.DynamicRecord@7d679e8a,s3://path/to/bucket/attribute=y,1578972694000).

我猜现在Glue试图创建一个空的DynamicFrame,然后它失败并显示以下消息:

ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
    at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource$$anonfun$3.apply(SparkSqlDecoratorDataSource.scala:38)
    at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource$$anonfun$3.apply(SparkSqlDecoratorDataSource.scala:38)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource.getOrInferFileFormatSchema(SparkSqlDecoratorDataSource.scala:37)
    at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource.resolveRelation(SparkSqlDecoratorDataSource.scala:53)
    at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getDynamicFrame$8.apply(DataSource.scala:640)
    at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getDynamicFrame$8.apply(DataSource.scala:604)
    at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
    at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
    at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:57)
    at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:63)
    at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:603)

您以前是否曾使用AWS Glue经历过类似的行为? 我正在考虑为“将要创建的”动态框架实施“空检查”,以阻止工作失败。还是您有任何可以确保工作书签正常工作的AWS本机解决方案?

0 个答案:

没有答案