在Hive

时间:2018-02-08 01:38:20

标签: hive subquery hiveql

我有一张表,其中一行代表一个订单。我正在尝试编写一个查询,返回2017年的所有客户订单,之后是2017年1月订购的第二个订单。

初始代码如下所示:

SELECT
order_date
,cust_id 
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1

这给出了以下输出,到目前为止一直很好:

-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 |   123   |       1     |       4        |
| 2017-01-02 |   123   |       2     |       4        |
| 2017-01-05 |   123   |       3     |       4        |
| 2017-09-27 |   123   |       4     |       4        |
| 2017-02-02 |   456   |       1     |       3        |
| 2017-11-16 |   456   |       2     |       3        |
| 2017-12-04 |   456   |       3     |       3        |
| 2017-01-17 |   678   |       1     |       5        |
| 2017-01-30 |   678   |       2     |       5        |
| 2017-02-31 |   678   |       3     |       5        |
| 2017-05-26 |   678   |       4     |       5        |
| 2017-09-18 |   678   |       5     |       5        |

但是,由于我只想检索必须在2017年1月发生的第二个订单之后的订单明细,我添加了一些其他条件,以便查询现在如下:

    SELECT
    order_date
    ,cust_id 
    ,nth_booking
    ,total_bookings
    FROM (SELECT order_date
    ,order_id
    ,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
    ,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
    FROM my.orders
    WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
  WHERE 
  nth_booking >= 2
  AND order_date BETWEEN '2017-01-01' AND '2017-01-31'

这显然是不正确的,我当然可以看到为什么在查看下面的结果时,其中order_date条件符合声明:

-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-02 |   123   |       2     |       4        |
| 2017-01-05 |   123   |       3     |       4        |
| 2017-01-30 |   678   |       2     |       5        |

然而,我想要的更像是这个,第二个订单是在2017年1月放置的,但我正在显示所有后续订单。

  -------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 |   123   |       2     |       4        |
| 2017-03-05 |   123   |       3     |       4        |
| 2017-09-27 |   123   |       4     |       4        |
| 2017-01-30 |   678   |       2     |       5        |
| 2017-02-31 |   678   |       3     |       5        |
| 2017-05-26 |   678   |       4     |       5        |
| 2017-09-18 |   678   |       5     |       5        |

我如何进入这个观点?

我很感激所提供的任何指导,并希望我提供了足够可重复的方法和工作细节。

提前致谢

1 个答案:

答案 0 :(得分:1)

second_order_jan flag计算cust_id并将其用于过滤:

select
      order_date
     ,cust_id 
     ,nth_booking
     ,total_bookings 
from
( --calculate second_order_jan flag for the cust_id
select cust_id,
       order_date,
       order_id,
       nth_booking,
       total_bookings,
       max(case when month(order_date) = 1 and nth_booking=2 then 1 end) over (partition by cust_id) second_order_jan_flag  
from 
(
SELECT cust_id,
     order_date
    ,order_id
    ,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
    ,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
    FROM my.orders
    WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31'
) t1
) t2 where second_order_jan_flag =1
       and nth_booking >= 2 --Filter only orders after second.