AWS Athena:删除日期范围之间的分区

时间:2019-02-18 11:49:02

标签: amazon-web-services amazon-athena presto

我有一个带有基于日期的分区的雅典娜表,像这样:

20190218

我要删除去年创建的所有分区。

我尝试了以下查询,但没有成功。

ALTER TABLE tblname DROP PARTITION (partition1 < '20181231');

ALTER TABLE tblname DROP PARTITION (partition1 > '20181010'), Partition (partition1 < '20181231');

3 个答案:

答案 0 :(得分:1)

尽管Athena SQL目前可能不支持它,但Glue API调用GetPartitions(Athena在后台使用该查询)支持复杂的过滤器表达式,类似于您可以在SQL {{1 }}表达式。

您可以使用Glue API进行GetPartitions,然后进行BatchDeletePartition,而不是通过Athena删除分区。

答案 1 :(得分:0)

根据https://docs.aws.amazon.com/athena/latest/ug/alter-table-drop-partition.html var[0]=("AAA aaa") var[1]=("BBB bbb") var[2]=("CCC ccc") 采用分区规范,因此不允许使用范围。

在Presto中,您可以进行ALTER TABLE tblname DROP PARTITION,但雅典娜也不支持DELETE FROM tblname WHERE ...

由于这些原因,您需要利用一些外部解决方案。

例如:

  1. 按照https://stackoverflow.com/a/48824373/65458
  2. 列出文件
  3. 删除文件并包含目录
  4. 更新分区信息(https://docs.aws.amazon.com/athena/latest/ug/msck-repair-table.html应该会有所帮助)

答案 2 :(得分:0)

这是执行 Theo 推荐的脚本。

import json
import logging

import awswrangler as wr
import boto3
from botocore.exceptions import ClientError

logging.basicConfig(level=logging.INFO, format=logging.BASIC_FORMAT)
logger = logging.getLogger()


def delete_partitions(database_name: str, table_name: str):
  client = boto3.client('glue')
  paginator = client.get_paginator('get_partitions')
  page_count = 0
  partition_count = 0
  for page in paginator.paginate(DatabaseName=database_name, TableName=table_name, MaxResults=20):
    page_count = page_count + 1
    partitions = page['Partitions']
    partitions_to_delete = []
    for partition in partitions:
      partition_count = partition_count + 1
      partitions_to_delete.append({'Values': partition['Values']})
      logger.info(f"Found partition {partition['Values']}")
    if partitions_to_delete:
      response = client.batch_delete_partition(DatabaseName=database_name, TableName=table_name,
        PartitionsToDelete=partitions_to_delete)
      logger.info(f'Deleted partitions with response: {response}')
    else:
      logger.info('Done with all partitions')


def repair_table(database_name: str, table_name: str):
  client = boto3.client('athena')
  try:
    response = client.start_query_execution(QueryString='MSCK REPAIR TABLE ' + table_name + ';',
      QueryExecutionContext={'Database': database_name}, )
  except ClientError as err:
    logger.info(err.response['Error']['Message'])
  else:
    res = wr.athena.wait_query(query_execution_id=response['QueryExecutionId'])
    logger.info(f"Query succeeded: {json.dumps(res, indent=2)}")


if __name__ == '__main__':
  table = 'table_name'
  database = 'database_name'
  delete_partitions(database_name=database, table_name=table)
  repair_table(database_name=database, table_name=table)