如何在配置单元中的分区数据中创建表?

时间:2018-03-29 11:44:14

标签: apache hive create-table hive-partitions

drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / _impala_insert_staging drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI [mgupta @ sjc-dev-binn01~] $ hadoop fs -ls / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI 找到27项 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201601 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201602 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201603 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201604 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201605 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201606 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201607 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201608 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201609 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201610 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201611 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201612 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201701 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201702 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201703 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201704 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201705 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201706 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:17 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201707 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201708 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201709 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201710 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201711 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201712 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201801 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201802 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:18 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201803 [mgupta @ sjc-dev-binn01~] $ hadoop fs -ls / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201601 找到3项 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201601 / company_sid = 0 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201601 / company_sid = 38527 drwxr-xr-x - mgupta supergroup 0 2018-03-26 22:16 / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201601 / company_sid = HIVE_DEFAULT_PARTITION [mgupta @ sjc-dev-binn01~] $ hadoop fs -ls / kylin / retailer / qi_basket_brand_bucket_fact / product_hierarchy_type = CI / month_id = 201601 / company_sid = 0 找到1项 -rw-r - r-- 3 mgupta supergroup 2069014 2018-03-26 22:16 /kylin/retailer/qi_basket_brand_bucket_fact/product_hierarchy_type=CI/month_id=201601/company_sid=0/f9466a0068b906cf-6ace7f8500000049_294515768_data.0.parq [mgupta @ sjc-dev-binn01~] $

2 个答案:

答案 0 :(得分:1)

您可以尝试以下步骤。

方法1

  1. 确定架构(列名称和类型 - 包括分区列)
  2. 创建一个配置单元分区表(确保添加分区列和分隔符信息)
  3. 将数据加载到分区表中。

    create table <table_name> (col1 data_type1, col2 data_type2..)
    partitioned by(part_col data_type3)
    row format delimited
    fields terminated by '<field_delimiter_in_your_data>'
    
    load data inpath '/hdfs/loc/file1' into table <table_name>
    partition (<part_col>='201601');
    
    load data inpath '/hdfs/loc/file1' into table <table_name>
    partition (<part_col>='201602')
    
    load data inpath '/hdfs/loc/file1' into table <table_name>
    partition (<part_col>='201603')
    

    依此类推。

  4. 方法2

    1. 使用与主表相同的模式创建临时表(临时表),但不包含任何分区
    2. 将您的整个数据加载到此表格中(确保您拥有&#39; 分区列&#39;作为这些文件中的某个字段)
    3. 使用动态分区插入从登台表加载数据到主表。

      create table <staging_table> (col1 data_type1, col2 data_type2..)
      row format delimited
      fields terminated by '<field_delimiter_in_your_data>'
      
      create table <main_table> (col1 data_type1, col2 data_type2..)
      partitioned by(part_col data_type3);
      
      load data inpath '/hdfs/loc/directory/' into table <staging_table>;
      
      SET hive.exec.dynamic.partition=true; 
      SET hive.exec.dynamic.partition.mode=nonstrict;
      
      insert into table <main_table> 
      partition(part_col) 
      select col1,col2,....part_col from <staging_table>;
      
    4. 方法2的主要方面是:

      • 制作&#39; part_col &#39;作为加载文件中的字段提供
      • 在最终的插入声明中,获取&#39; part_col &#39;作为select子句的最后一个字段。

答案 1 :(得分:0)

让我们创建一个表,在年和月上有一个分区,表中有一个时间戳:

CREATE TABLE `mypart_p`(
   `id` bigint, 
   `open_ts` string 
)
PARTITIONED BY (YEAR INT, MONTH INT)

现在我必须改变表格。

ALTER TABLE mypart_p ADD PARTITION (YEAR=2020, MONTH=1)

我必须每年和每月都这样做,在 python 中循环执行。现在让我们用数据填充它并指定该数据属于哪个分区:

INSERT into mypart_p PARTITION (YEAR=2020, MONTH=1)

select id,
open_ts

FROM some_other_table

WHERE substring(open_ts,0,4) = '2020'
AND substring(open_ts,6,2) = '01'