Question

我无法弄清楚CTAS查询出了什么问题，即使我没有提到任何存储分区列，它也会将数据分成较小的文件，同时存储在分区内。是否有办法避免这些小文件并在每个分区存储为一个文件，因为小于128 MB的文件会导致额外的开销？

    CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(format = 'PARQUET'
     parquet_compression = 'GZIP',
     external_location='s3://mybucket/Athena/tables/parquet/'
    partitioned_by=ARRAY['year','month']
    )
AS 
SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y')  AS year,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c')  AS month
FROM sampleDB.yellow_trip_data_raw;

image from my partition

Answer 1

Athena是一个分布式系统，它将通过某种不可观察的机制扩展查询的执行。看起来它决定对CTAS查询使用5个工作线程，这将在每个分区中产生5个文件。

您可以尝试明确指定存储桶大小为1，但如果我没记错的话，您仍然可能会得到多个文件。

Answer 2

我能够通过创建存储分区列（month_a）来解决此问题。下面是代码

    CREATE TABLE sampledb.yellow_trip_data_avro WITH(
        format = 'AVRO', external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/', partitioned_by=ARRAY['year','month'], bucketed_by=ARRAY['month_a'],
         bucket_count=12) ASSELECT VendorID,
         tpep_pickup_datetime,
         tpep_dropoff_datetime,
         passenger_count,
         trip_distance,
         RatecodeID,
         store_and_fwd_flag,
         PULocationID,
         DOLocationID,
         payment_type,
         fare_amount,
         extra,
         mta_tax,
         tip_amount,
         tolls_amount,
         improvement_surcharge,
         total_amount,
         date_format(date_parse(tpep_pickup_datetime,
        '%Y-%c-%d %k:%i:%s'),'%c') AS month_a, date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y') AS year, date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

如何避免AWS Athena CTAS查询创建小文件？

2 个答案: