Hive多次计数(使用和不使用DISTINCT)会生成错误输出

时间:2017-04-14 14:06:27

标签: hive hiveql

我尝试了这个Hive查询

Select id,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) 
         ,count(CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) 
From DB.TABLE2 GROUP BY id limit 10;

它给了我像smthg:

111007001007633 1       1
111007001029793 1       1
111007001000521 1       11
111007001000794 1       1
111007001000273 3       13
111007001001032 1       1
111007001025874 1       4
111007001001792 1       7
111007001029181 1       1
111007001000141 16      96

但是当我添加其他计数时:

 Select id,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) 
         ,count(CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),60) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) 
         ,count(distinct CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),15) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) 
         ,count(CASE WHEN unix_timestamp(m_date) BETWEEN unix_timestamp(cast(date_sub(cast('2017-02-01' as date),15) as date)) AND unix_timestamp(cast('2017-02-01' as date)) THEN m_date ELSE 0 END) 
 From DB.TABLE2 GROUP BY id limit 10;

它返回的内容如下:

 111007001010439 0       0       1       0
 111007001026963 0       0       1       0
 111007001028001 0       0       1       0
 111007001032987 0       0       1       0
 111007001048710 0       0       1       0
 111007001052415 0       0       1       0
 111007002008374 0       0       1       0
 111007003000644 0       0       1       0
 111007003002210 0       0       1       0

我在hadoop集群上工作,如果它是由map reduce引起的,我就不会这样做。

由于

[编辑]

当我回答@pashaz评论时,第一个问题是来自两个相同查询(有和没有不同)的结果,其中1表示不同,0表示非不同。

第二个问题是两个不同查询和两个非不同查询之间的结果。如果您检查时间戳,您将看到第一个查询包含秒数,因为两个第一次计算“2017-02-01”和 60天之间的出现次数,“2017-”之间的次数计数出现次数02-01“和 15天之前。

[UPDATE]

如果我把WHERE子句放在其中

 WHERE id="111007001007633" OR id="271011604404359" OR id="122213250512607" OR id="111007001033217"


111007001033217 0       0       0       0       0       0
122213250512607 1       3       8       14      0       0
271011604404359 12      21      26      42      5       9
111007001007633 14      19      24      34      5       5

LIMIT条款似乎是问题所在。

1 个答案:

答案 0 :(得分:1)

提供的结果没什么不好的。在两个查询中出现“限制10”。没有保证会返回相同的身份证。

在第一个查询结果中显示“111007001007633”,在第二个查询中不存在。