HIVE查询HAVING和Distinct

时间:2017-05-02 04:21:14

标签: hadoop hive

我在Hive表下面

accountNum  date  status  action qty time
    ----------  ----  ------  ------ --- ----
    1234        2017   filled  B      10  11:20
    1234        2017   filled  S      10  11:20
    2345        2017   filled  B      20  12:00
    2345        2017   filled  B      10  12:00
    4444        2017   filled  B       5  01:00
    4444        2017   filled  S       5  02:00

这里我想比较两行动作“B”然后动作“S”。如果在第一个B找到2行,然后在那些记录上找到S,我必须检查accountNum,日期,时间,状态是否相同。

因此,基于上述测试数据,我应该只得到前两行

accountNum  date  status  action qty time
----------  ----  ------  ------ --- ----
1234        2017   filled  B      10  11:20
1234        2017   filled  S      10   11:20

对于这个我应该写什么类型的查询?

我有以下的mysql查询,但HIVE不支持HAVING / DISTINCT / COUNT所以它在HIVE中不起作用..无论如何使用HAVING或任何方式使用JOIN和写查询?

select  t1.*
from    yourTable t1
join    (
            select  accountNum, date, status, time
            from    yourTable
            where   action in ('B', 'S')
            group by accountNum, date, status, time
            having  count(distinct action) = 2
        ) t2
on      t1.accountNum = t2.accountNum and
        t1.date = t2.date and
        t1.status = t2.status and
        t1.time = t2.time

1 个答案:

答案 0 :(得分:0)

1。

HAVING是保留字。

2

如果SELECT条款中的表达式没有出现在select t1.* from yourTable t1 join ( select accountNum, date, status, time,count(distinct action) from yourTable where action in ('B', 'S') group by accountNum, `date`, status, time having count(distinct action) = 2 ) t2 on t1.accountNum = t2.accountNum and t1.`date` = t2.`date` and t1.status = t2.status and t1.time = t2.time 子句中,则表达式似乎存在限制。

此查询(基于您的原始查询)有效:

+------------+------+--------+--------+-----+-------+
| accountnum | date | status | action | qty | time  |
+------------+------+--------+--------+-----+-------+
|       1234 | 2017 | filled | B      |  10 | 11:20 |
|       1234 | 2017 | filled | S      |  10 | 11:20 |
+------------+------+--------+--------+-----+-------+
select  accountnum,`date`,status,action,qty,time

from   (select  *
               ,max(case when action = 'B' then 1 end) over w as b_flag
               ,max(case when action = 'S' then 1 end) over w as s_flag

        from    yourTable

        where   action in ('B', 'S')

        window  w as (partition by  accountNum, `date`, status, time)
        ) t

where   b_flag = 1
    and s_flag = 1
;

这是另一种基于Windows功能的解决方案

+------------+------+--------+--------+-----+-------+
| accountnum | date | status | action | qty | time  |
+------------+------+--------+--------+-----+-------+
|       1234 | 2017 | filled | B      |  10 | 11:20 |
|       1234 | 2017 | filled | S      |  10 | 11:20 |
+------------+------+--------+--------+-----+-------+
import numpy as np
import theano
from scipy.interpolate import interp1d
import pymc3 as pm3
theano.config.compute_test_value = 'ignore'
theano.config.on_unused_input = 'ignore'

class cprofile:
    observations = np.array([6.25,2.75,1.25,1.25,1.5,1.75,1.5,1])
    x = np.arange(0,18,0.5)
    observed_x = np.array([0.3,1.4,3.1,5,6.8,9,13.4,17.1])    

    def doMAP(self):
        model = pm3.Model()
        with model:
            t = pm3.Uniform("t",0,5)
            y = pm3.Uniform("y",0,5)
            z = pm3.Uniform("z",0,5)
            obs=pm3.Normal('obs',
              mu=FunctionIWantToFit(self)(t,y,z),
              sd=0.1,observed=self.observations)
            start = pm3.find_MAP()
            print('start: ',start)

class FunctionIWantToFit(theano.gof.Op):
    itypes=[theano.tensor.dscalar,
            theano.tensor.dscalar,
            theano.tensor.dscalar]
    otypes=[theano.tensor.dvector]

    def __init__(self, cp):
        self.cp = cp # note cp is an instance of the 'cprofile' class

    def perform(self,node, inputs, outputs):
        t, y, z = inputs[0], inputs[1], inputs[2]

        xxx = self.cp.x
        temp = t+y*xxx+z*xxx**2
        interpolated_concentration = interp1d(xxx,temp)   
        outputs[0][0] = interpolated_concentration(self.cp.observed_x)

testcp=cprofile()
testcp.doMAP()