Question

我有一个临时外部表，并将HDFS中的数据放入此表中。现在，我将相同的数据插入分区主外部表中。数据已成功插入，但是当我使用列查询主表时，这些列的值却有所不同。

我已经使用csv文件将数据加载到了包含四个字段的临时文件中。

col1=id
col2=visitDate
col3=comment
col4=age

以下是查询及其结果：

临时表：

create external table IF NOT EXISTS  dummy1(id string,visitDate string,comment string, age string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
;

MAIN Table:

create external table IF NOT EXISTS  dummy1(id string,comment string)
PARTITIONED BY (visitDate string, age string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
;


Result:

Temporary table:

select *from incr_dummy1;

1       11      a       20
2       12      b       3
1       13      c       34
4       14      d       23
5       15      e       45
6       16      f       65
7       17      g       78
8       18      h       9
9       19      i       12
10      20      j       34

select visitDate,age from incr_dummy1;

11      20
12      3
13      34
14      23
15      45
16      65
17      78
18      9
19      12
20      34


Main Table:

select *from dummy1;

1       11      a       20
2       12      b       3
1       13      c       34
4       14      d       23
5       15      e       45
6       16      f       65
7       17      g       78
8       18      h       9
9       19      i       12
10      20      j       34

select visitDate,age from dummy1;

a       20
b       3
c       34
d       23
e       45
f       65
g       78
h       9
i       12
j       34

因此在上面的主外部表中，当我查询“ visitDate”列时，“ comment”列的值即将出现。

请让我知道我在这里犯什么错误？

Answer 1

我可以看到列顺序是 not same in temporary and final tables 。

从Temporary table to final table插入数据时，请检查您在 select statement(partition cols needs to be at the end of select cols) 中的列顺序是否正确。

hive> insert into dummy1 partition(visitDate,age) select id,comment,visitDate,age from incr_dummy1;

以防万一，如果您仍然遇到问题，最好检查一下

当您拥有外部分区表时（当我们删除表数据时，将不会在HDFS上删除它们）， check the hdfs directory 是否有任何未删除的额外文件
然后drop the table, delete the hdfs directory和create the table then run your job again。

更新：

Option1:

是否可以 temporary table with final table 中的匹配列顺序，如果可以，则更改列的顺序。

Option2:

使用subquery with quoted identifier排除原始列，仅将别名列添加到我们的最终选择查询中。

hive> set hive.support.quoted.identifiers=none;
hive> insert into dummy1 partition(visitDate,age)
      select `(visitDate|age)?+.+` from --exlude visitDate,age columns. 
     (select *,visitDate vis_dat,age age_n from incr_dummy1)t;

为什么主表和临时表给出不同的结果？

1 个答案: