蜂巢-从三个外部表创建一个内部表

时间:2020-04-24 11:05:11

标签: hive hdfs hiveql hive-table

我在HIVE中有三个外部表:

表1:

CREATE EXTERNAL TABLE IF NOT EXISTS table_1(
unique_key_column_1 VARCHAR,
column_needed_1 DATE,   
redundant_column_1 VARCHAR,
redundant_column_2 VARCHAR,
redundant_column_3 VARCHAR,
column_needed_2 TIMESTAMP,
redundant_column_4 VARCHAR,
redundant_column_5 VARCHAR,
column_needed_3 INT,
redundant_column_6 VARCHAR,
redundant_column_7 VARCHAR)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’
STORED AS TEXTFILE location '/user/<username>/visdata';

表2:

CREATE EXTERNAL TABLE IF NOT EXISTS table_2(
unique_key_column_1 VARCHAR,
column_needed_4 VARCHAR,
column_needed_5 VARCHAR,
unique_key_column_2 VARCHAR,
redundant_column_1 VARCHAR,
redundant_column_2 VARCHAR,
redundant_column_3 VARCHAR,
column_needed_6 TINYINT,
redundant_column_4 VARCHAR,
redundant_column_5 VARCHAR,
column_needed_7 DATE,
redundant_column_6 VARCHAR,
redundant_column_7 VARCHAR)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’
STORED AS TEXTFILE location '/user/<username>/visdata';

表3:

CREATE EXTERNAL TABLE IF NOT EXISTS table_3(
unique_key_column_2 VARCHAR,
redundant_column_1 VARCHAR,
redundant_column_2 VARCHAR,
redundant_column_3 VARCHAR,
redundant_column_4 VARCHAR,
redundant_column_5 VARCHAR,
column_needed_8 VARCHAR,
column_needed_9 TINYINT,
redundant_column_6 VARCHAR,
redundant_column_7 VARCHAR,
column_needed_10 TIMESTAMP)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',’
STORED AS TEXTFILE location '/user/<username>/visdata';

我现在要创建一个托管表,在我的两个唯一键列的表上方使用左外部连接,如下所示:

unique_key_column_1 column_needed_1 column_needed_2 column_needed_3 column_needed_4 column_needed_5 column_needed_1 column_needed_6 column_needed_7 unique_key_column_2 column_needed_8 column_needed_9 column_needed_10
key_entry_1_1 entry_1_1 entry_1_2 entry_1_3 entry_1_4 entry_1_5 entry_1_6 entry_1_7 key_entry_1_2 entry_1_8 entry_1_9 entry_1_10
key_entry_2_2 entry_2_1 entry_2_2 entry_2_3 entry_2_4 entry_2_5 entry_2_6 entry_2_7 key_entry_2_2 entry_2_8 entry_2_9 entry_2_10

我该怎么做?

编辑1:
这是我想出的,可以从两个表中加入。我仍然不知道如何将三个表合并成一个表:

> create table combined_table;
> insert into combined_table SELECT * FROM (SELECT r.unique_key_column_1, r.column_needed_1, r.column_needed_2, r.column_needed_3, o.r.column_needed_4, o.column_needed_5, o.column_needed_6, o.column_needed_7 FROM table_1 LEFT OUTER JOIN table_2 o ON (r.unique_key_column_1 = o.unique_key_column_2 );

编辑2:
我只是意识到联接很昂贵。那么,有什么我可以使用分区吗?

2 个答案:

答案 0 :(得分:1)

@NaveenKumar此处的解决方案是为所需的CombinedTable编写架构。然后将3个表的结果插入最终表。

INSERT INTO combinedTable [SELECT JOIN QUERY HERE]

答案 1 :(得分:1)

您可以通过左连接所有三个表来创建组合表。检查以下查询。

创建表并插入数据。

CREATE TABLE IF NOT EXISTS COMBINED_TABLE AS 
SELECT
   UNIQUE_KEY_COLUMN_1,
   TBLA.COLUMN_NEEDED_1,
   TBLA.COLUMN_NEEDED_2,
   TBLA.COLUMN_NEEDED_3,
   TBLB.COLUMN_NEEDED_4,
   TBLB.COLUMN_NEEDED_5,
   TBLB.COLUMN_NEEDED_6,
   TBLB.COLUMN_NEEDED_7,
   TBLC.UNIQUE_KEY_COLUMN_2,
   TBLC.COLUMN_NEEDED_8,
   TBLC.COLUMN_NEEDED_9,
   TBLC.COLUMN_NEEDED_10,
FROM
   TABLE_1 TBLA 
   LEFT JOIN
      TABLE_2 TBLB 
      ON TBLA.UNIQUE_KEY_COLUMN_1 = TBLB.UNIQUE_KEY_COLUMN_1 
   LEFT JOIN
      TABLE_3 TBLC 
      ON TBLC.UNIQUE_KEY_COLUMN_2 = TBLB.UNIQUE_KEY_COLUMN_1;

如果已创建目标表,则将数据插入表中。

INSERT INTO COMBINED_TABLE
SELECT
   UNIQUE_KEY_COLUMN_1,
   TBLA.COLUMN_NEEDED_1,
   TBLA.COLUMN_NEEDED_2,
   TBLA.COLUMN_NEEDED_3,
   TBLB.COLUMN_NEEDED_4,
   TBLB.COLUMN_NEEDED_5,
   TBLB.COLUMN_NEEDED_6,
   TBLB.COLUMN_NEEDED_7,
   TBLC.UNIQUE_KEY_COLUMN_2,
   TBLC.COLUMN_NEEDED_8,
   TBLC.COLUMN_NEEDED_9,
   TBLC.COLUMN_NEEDED_10,
FROM
   TABLE_1 TBLA 
   LEFT JOIN
      TABLE_2 TBLB 
      ON TBLA.UNIQUE_KEY_COLUMN_1 = TBLB.UNIQUE_KEY_COLUMN_1 
   LEFT JOIN
      TABLE_3 TBLC 
      ON TBLC.UNIQUE_KEY_COLUMN_2 = TBLB.UNIQUE_KEY_COLUMN_1;