仅针对来自不同日期的一条记录的Hive查询帮助

时间:2017-03-20 07:02:28

标签: hive

Store number    allocwgt    item    date    day
88006   0.14    40000349094 1/6/2013    Sunday
10374   0.14    40000349094 1/6/2013    Sunday
88010   0.14    40000349094 3/19/2017   Sunday
9388    0   40000349094 1/7/2013    Monday
9300    0   40000349094 3/20/2017   Monday
9300    0   40000349094 3/27/2017   Monday
1139    0   40000349094 3/16/2015   Monday

对于一个项目,我只需要选择一天中的一个记录,即星期日,因为在所有日期,allocwgt的值都是相同的。

一个项目可以为不同的日期每天创建多条记录,但我只需要7条记录。每天1条记录

i.e sunday, monday, tuesday as on..
 Note : if record selected is of most updated will be good  
Can someone help me in making this in hive query.

Expected output should be:

Store number    allocwgt    item            date            day 
88006           0.14        40000349094     2017-03-19      Sunday
09300           0.00        40000349094     2017-03-27      Monday

enter image description here

2 个答案:

答案 0 :(得分:0)

使用row_number()。以下查询将为每个item选择一条记录,且store_number最少。在order by中写下正确的over()可以更改此行为,如果您需要为每个项目添加任何单个记录,并且order by我已更换,则只需删除date date store_date列,datehive中的保留字。

select Store_number, allocwgt, item, store_date, day
from
(
select Store_number, allocwgt, item, store_date, day, 
       row_number() over(partition by item, store_date order by store_number) rn
from table_name
) s
where rn=1

答案 1 :(得分:0)

Thanks Query is giving result as expected but I had a confusion that allocwgt value will be same but it could be different which I found.

Now when I ran below query :   

 create table temp_cso_2 as 
select *
from
(
select b.loc,
 a.allocwgt, 
 b.item,
 date_add('1970-01-01',cast ((a.Eff/1440)as int)) as date_from_minutes,
 date_format(date_add('1970-01-01',cast ((a.Eff/1440)as int)),'EEEE') as day_of_date, 
 row_number() over(partition by item, date_add('1970-01-01',cast ((a.Eff/1440)as int)) order by b.loc) rn
from scm.CALDATA a left outer join scm.SKUDEMANDPARAM b
on a.cal = b.alloccal
where a.repeat = 0 and b.run_date= to_date('2017-03-02' ) and b.item between 40000000000 and 40000999999
) s
where rn=1      

this query gives me below result  

    ------------------------------+-----------------+----------------------+------------------+-------------------------------+-------------------------+----------------+--+
    | temp_cso_2.loc  | temp_cso_2.allocwgt  | temp_cso_2.item  | temp_cso_2.date_from_minutes  | temp_cso_2.day_of_date  | temp_cso_2.rn  |
    +-----------------+----------------------+------------------+-------------------------------+-------------------------+----------------+--+
    | 00074           | 0.15                 | 40000110552      | 2013-01-10                    | Thursday                | 1              |
    | 00074           | 0.17                 | 40000110552      | 2013-01-11                    | Friday                  | 1              |
    | 00074           | 0.17                 | 40000110552      | 2013-01-12                    | Saturday                | 1              |
    | 00074           | 0.12                 | 40000110552      | 2013-01-06                    | Sunday                  | 1              |
    | 00074           | 0.12                 | 40000110552      | 2013-01-07                    | Monday                  | 1              |
    | 00074           | 0.13                 | 40000110552      | 2013-01-08                    | Tuesday                 | 1              |
    | 00074           | 0.14                 | 40000110552      | 2013-01-09                    | Wednesday               | 1              |
    | 00074           | 0.0                  | 40000110552      | 2018-04-24                    | Tuesday                 | 1              |
    +-----------------+----------------------+------------------+-------------------------------+-------------------------+----------------+--+

So problem is in tuesday record. I got two records because allocwgt are differnt so what should I do so that I get only one latest date record. Also something to increase perfromance of this query ?  Please help