Question

我是Hive的新手，如果有人可以帮助我处理我的Hive查询，请多谢。

有两个表A和B，它们的模式完全相同，但是数据却有4个分区。我需要将这两个表合并为一个具有（4 + 1 = 5）分区的表。添加的分区告诉数据来自哪个表。例如，假设新分区被命名为“源”。如果数据来自表A，则源等于“ from_A”；如果数据来自表B，则源等于“ from_B”。

hive> desc A;
OK
col1 string,
col2 string,
DD   string,                                    
EE   string,                                    
FF   string,                                    
GG   string 

# Partition Information      
# col_name              data_type               

DD              string                                      
EE              string                                      
FF                  string                                      
GG              string

和

hive> desc B;
OK
col1 string,
col2 string,
DD   string,                                    
EE   string,                                    
FF   string,                                    
GG   string

# Partition Information      
# col_name              data_type               comment             

DD              string                                      
EE              string                                      
FF                  string                                      
GG              string

Answer 1

创建新的分区表

Create table C (
col1 string,
col2 string
)
partitioned by (
source string,
DD   string,                                    
EE   string,                                    
FF   string,                                    
GG   string
);

然后将数据加载到新表中：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table C partition(source,DD,EE,FF,GG)
select col1, col2, 
       --partitions
      'from_A' source, DD, EE, FF, GG 
  from A
distribute by DD, EE, FF, GG;

并行加载表B中的数据：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table C partition(source,DD,EE,FF,GG)
select col1, col2, 
      --partitions
      'from_B' source, DD, EE, FF, GG 
 from B
distribute by DD, EE, FF, GG;

将两个分区表合并为一个表，但合并为两个不同的分区

1 个答案: