HIST多次插入与DISTINCT select语句出错

时间:2013-03-02 10:50:43

标签: hadoop hive hiveql

我从“ Hadoop the Definitive Guide ”中读取此代码:

SELECT a.ad_id, a.campaign_id, a.account_id, b.user_id
FROM dim_ads a JOIN impression_logs b ON (b.ad_id = a.ad_id)
WHERE b.dateid = '2008-12-01') x
INSERT OVERWRITE DIRECTORY 'results_gby_adid'
SELECT x.ad_id, count(1), count(DISTINCT x.user_id) GROUP BY x.ad_id
INSERT OVERWRITE DIRECTORY 'results_gby_campaignid'
SELECT x.campaign_id, count(1), count(DISTINCT x.user_id) GROUP BY x.campaign_id
INSERT OVERWRITE DIRECTORY 'results_gby_accountid'
SELECT x.account_id, count(1), count(DISTINCT x.user_id) GROUP BY x.account_id;

但是作为我的测试,使用多个DISTINCT无法获得正确的结果。

我的hiveql如下:

CREATE TABLE IF NOT EXISTS a (logindate int, id int);

然后 将本地文件加载到此表...

CREATE TABLE IF NOT EXISTS user (id INT) PARTITIONED BY (logindate INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

然后 如果单独插入表格:

INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) FROM a WHERE logindate=20130120;
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) FROM a WHERE logindate=20130121;

结果是正确的;

但是如果选择下一个多次插入hql:

FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT DISTINCT(id) WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT DISTINCT(id) WHERE logindate=20130121;
the results are not correct, both partitions have the same number of records, seems like select from DISTINCT(id) WHERE logindate=20130120 OR logindate=20130121

这是一个错误还是我写了一些错误的语法?

1 个答案:

答案 0 :(得分:1)

DISTINCT在代码中有一些奇怪的历史记录作为分组依据的别名。 如果存在错误,那么您正在使用的hive版本将非常重要,因为每个版本都会解决错误。

这可能有效:

FROM a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120 GROUP BY id
INSERT OVERWRITE TABLE user PARTITION(logindate=20130121) SELECT id WHERE logindate=20130121 GROUP BY id;

如果这不起作用,这肯定会奏效......即使它不是你试图使用的方法......

FROM (select distinct id, logindate from a where logindate in ('20130120','20130121')) subq_a
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130120
INSERT OVERWRITE TABLE user PARTITION(logindate=20130120) SELECT id WHERE logindate=20130121;