Spark SQL(通过HiveContext进行Hive查询)如果hive表中存在多个分区,则INSERT OVERWRITE不会覆盖现有数据

时间:2017-11-22 14:43:45

标签: apache-spark hive apache-spark-sql

//蜂房1.2.1000.2.6.1.0-129 我们正在尝试使用多个分区来插入OVERWRITE test5表。根据文档(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML),INSERT OVERWRITE将覆盖表或分区中的任何现有数据。但是在INSERT OVERWRITE查询被触发后,我们仍然得到一些旧数据。下面是示例执行和输出。

//火花2.1.1 在Spark-2.1.1

中运行HiveContext时,我们得到了相同的结果
CREATE TABLE dbtest.test5 (emp_id INT) PARTITIONED BY (depart_id INT,depart_name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'externalpath'; 

INSERT INTO TABLE dbtest.test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from dbtest.tempTableHive1; 

4       123     Dev 
5       123     Dev 
6       123     Test 
7       567     Test 

INSERT INTO TABLE dbtest.test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from dbtest.tempTableHive2; 
4       123     Dev 
5       123     Dev 
1       123     Dev 
2       123     Dev 
6       123     Test 
3       123     Test 
7       567     Test 

INSERT OVERWRITE TABLE dbtest.test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from dbtest.tempTableHive3; 

8       123     Dev 
9       123     Dev 
10      123     Dev 
6       123     Test 
3       123     Test 
7       567     Test 

代码有什么问题,或者是apache hive问题?

1 个答案:

答案 0 :(得分:0)

当您指定INSERT OVERWRITE时,Hive将覆盖该分区。请参阅下面的cloudera快速启动VM的输出。

hive> SELECT * FROM tempTableHive1;
OK
4   123 Dev
5   567 Test
Time taken: 0.048 seconds, Fetched: 2 row(s)
hive> INSERT INTO TABLE test5  PARTITION (depart_id,depart_name) SELECT emp_id,depart_id,depart_name from tempTableHive1; 

hive> SELECT * FROM test5;
OK
4   123 Dev
5   567 Test
Time taken: 0.065 seconds, Fetched: 2 row(s)

hive> SELECT * FROM tempTableHive2;
OK
4   123 Dev
6   123 Dev
Time taken: 0.047 seconds, Fetched: 2 row(s)

hive> INSERT INTO TABLE test5  PARTITION (depart_id,depart_name) 
    > SELECT emp_id,depart_id,depart_name from tempTableHive2; 

hive> SELECT * FROM test5;
OK
4   123 Dev
4   123 Dev
6   123 Dev
5   567 Test
Time taken: 0.057 seconds, Fetched: 4 row(s)

hive> SELECT * FROM tempTableHive3;
OK
100 123 Dev
101 123 Dev

hive> INSERT OVERWRITE TABLE test5  PARTITION (depart_id,depart_name) 
    > SELECT emp_id,depart_id,depart_name from tempTableHive3;

hive> SELECT * FROM test5;
OK
100 123 Dev
101 123 Dev
5   567 Test
Time taken: 0.072 seconds, Fetched: 3 row(s)

如果您仍然遇到问题,最好的调试方法是检查HDFS文件。每个部门ID /部门名称组合应该有一个文件。示例/ user / hive / warehouse / test5 / depart_id = 123 / depart_name = Dev。因为它们是文本文件,所以您将能够快速地使用#34; cat"他们看到的内容。让我们知道你是怎么过的。