分区加载后,AWS Athena创建缩进并将值移动到错误的列中

时间:2018-07-11 11:00:48

标签: hadoop amazon-s3 hive partitioning amazon-athena

我遇到了以下问题:

  1. 我在没有分区的HDFS的EMR群集中创建了一个Hive表 并加载了数据。
  2. 我基于     第1段中的表格,但带有datetime的分区     列:PARTITIONED BY(年STRING,月STRING,天STRING)。
  3. 我将非分区表中的数据加载到分区表中并获得有效结果。
  4. 我创建了一个与Hive表具有相同结构的Athena数据库和表。
  5. 我从本地复制了HDFS的分区文件,并通过AWS s3同步将所有文件传输到S3空存储桶中。所有文件均无错误传输,并且传输顺序与HDFS中Hive目录中的顺序相同。
  6. 我通过MSCK REPAIR TABLE加载了分区,但在输出中没有得到任何错误。

此后,我发现许多值都缩进,例如,“ Operating_sys”列等中需要包含在“ IP”列中的值。

我的脚本是:

-- Hive tables

SET hive.exec.dynamic.partition = true;  
SET hive.exec.dynamic.partition.mode = nonstrict; 

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_page_part 
    ( 
        log_DATE STRING,  
        user_id STRING, 
        page_path STRING, 
        referer STRING,
        tracking_referer STRING,
        medium STRING,
        campaign STRING,
        source STRING,
        visitor_id STRING,
        ip STRING,
        session_id STRING,
        operating_sys STRING,
        ad_id STRING,
        keyword STRING,
        user_agent STRING
    )
PARTITIONED BY
(
        `year` STRING,
        `month` STRING,
        `day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/events_partitioned';

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs_event_part
    ( 
        log_DATE STRING, 
        user_id STRING, 
        category STRING, 
        action STRING, 
        label STRING, 
        value STRING,
        visitor_id STRING,
        ip STRING,
        session_id STRING,
        operating_sys STRING,
        extra_data_json STRING
    )
PARTITIONED BY
(
        `year` STRING,
        `month` STRING,
        `day` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
LOCATION '/user/admin/pages_partitioned';

INSERT INTO TABLE cloudfront_logs_page_part
PARTITION 
(
    `year`,
    `month`,
    `day`
)
SELECT
    log_DATE,
    user_id,
    page_path,
    referer,
    tracking_referer,
    medium, 
    campaign, 
    source,
    visitor_id,
    ip,
    session_id,
    operating_sys,
    ad_id,
    keyword,
    user_agent,
    year(log_DATE) as `year`,
    month(log_DATE) as `month`,
    day(log_DATE) as `day`
FROM
    cloudfront_logs_page;

INSERT INTO TABLE cloudfront_logs_event_part
PARTITION 
(
    `year`,
    `month`,
    `day`
)
SELECT
    log_DATE,
    user_id,
    category,
    action,
    label,
    value,
    visitor_id,
    ip,
    session_id,
    operating_sys,
    extra_data_json,
    year(log_DATE) as `year`,
    month(log_DATE) as `month`,
    day(log_DATE) as `day`
FROM
    cloudfront_logs_event;

-- Athena tables

CREATE DATABASE IF NOT EXISTS test
LOCATION 's3://...';

DROP TABLE IF EXISTS test.cloudfront_logs_page_ath;

CREATE EXTERNAL TABLE IF NOT EXISTS powtoon_hive.cloudfront_logs_page_ath ( 
    log_DATE STRING,  
    user_id STRING, 
    page_path STRING, 
    referer STRING,
    tracking_referer STRING,
    medium STRING,
    campaign STRING,
    source STRING,
    visitor_id STRING,
    ip STRING,
    session_id STRING,
    operating_sys STRING,
    ad_id STRING,
    keyword STRING,
    user_agent STRING
)
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS   TERMINATED BY ','
LOCATION 's3://.../';

DROP TABLE IF EXISTS test.cloudfront_logs_event_ath;

CREATE EXTERNAL TABLE IF NOT EXISTS test.cloudfront_logs_event_ath 
    ( 
        log_DATE STRING, 
        user_id STRING, 
        category STRING, 
        action STRING, 
        label STRING, 
        value STRING,
        visitor_id STRING,
        ip STRING,
        session_id STRING,
        operating_sys STRING,
        extra_data_json STRING
    )
PARTITIONED BY (`year` STRING,`month` STRING, `day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 's3://.../';

有什么问题?表结构?雅典娜元数据?

1 个答案:

答案 0 :(得分:1)

最简单的方法是将原始文件直接转换为分区的Parquet列式格式。这具有分区,列存储,谓词下推以及所有其他奇特的词的优点。

请参阅:Converting to Columnar Formats - Amazon Athena