我目前正在建立一个数据湖,每天在此运行AWS GlueJobs,以复制数据库中的数据,并使它们可通过AWS Athena进行查询。由于我获取的数据架构经常更改,因此我会定期使用Glue Crawler对其进行爬网。不幸的是,当我连续两天运行搜寻器并且架构更改时,出现关于不兼容架构的错误:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://***/raw/itemstore/parquet_flattened/v1/type=articles/year=2019/month=12/day=12/part-00012-13fc8243-cd4e-47b8-8763-56b15ea46e84-c000.snappy.parquet (offset=0, length=32745292): Schema mismatch, metastore schema for row column item__timeline.element has 10 fields but parquet schema has 9 fields
This query ran against the "***" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: ***
这是我们的搜寻器以云形成的代码:
ItemStoreCrawler:
Type: AWS::Glue::Crawler
Properties:
Name: <A STRING>
DatabaseName: !Ref DatabaseName
Configuration: "{\"Version\": 1.0, \"CrawlerOutput\": {\"Partitions\": {\"AddOrUpdateBehavior\": \"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
Role: !GetAtt CrawlerRole.Arn
TablePrefix: String
Tags:
Platform: !Ref Platform
Maintainer: !Ref Maintainer
ServerType: !Ref ServerType
ServiceName: !Sub ${ProjectName}
Environment: !Ref Environment
Targets:
S3Targets:
- Path: String
我的猜测是,我的搜寻器的模式合并行为在以Configuration
开头的行中设置不正确,但我找不到修复程序。
答案 0 :(得分:0)
这与忽略列顺序有关-我强烈建议不要使用胶履带-使用胶水作为Hive Metastore将表直接写入Athena以避免这种情况。
https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html#summary-of-updates