我在hive中有一个看起来像这样的表
cust_id prod_id timestamp
1 11 2011-01-01 03:30:23
2 22 2011-01-01 03:34:53
1 22 2011-01-01 04:21:03
2 33 2011-01-01 04:44:09
3 33 2011-01-01 04:54:49
等等。
对于每条记录,我想检查该客户在过去24小时内购买了多少独特产品,不包括当前交易。所以输出看起来应该是这样的 -
1 0
2 0
1 1
2 1
3 0
我的hive查询看起来像这样
select * from(
select t1.cust_id, count(distinct t1.prod_id) as freq from temp_table t1
left outer join temp_table t2 on (t1.cust_id=t2.cust_id)
where t1.timestamp>=t2.timestamp
and unix_timestamp(t1.timestamp)-unix_timestamp(t2.timestamp) < 24*60*60
group by t1.cust_id
union all
select t.cust_id, 0 as freq from temp_table t2
)unioned;
答案 0 :(得分:0)
只需获取过去24小时内的所有行,就可以通过custid和count(distinct productid)-1作为输出。整体查询看起来像这样。
从table_name中选择cust_id,COUNT(distinct prod_id) - 1 unix_timestamp(t1.timestamp)-unix_timestamp(t2.timestamp)&lt; 24 * 60 * 60 GROUP BY cust_id
*我在这里减去1以排除用户的最新transactionid。 (希望这就是你的意思)
答案 1 :(得分:0)
您可以加入一个派生表,其中包含过去24小时内为每个客户/时间戳对购买的不同产品数量。
select t1.cust_id, t1.prod_id, t1.timestamp, t2.count_distinct_prod_id - 1
from mytable t1
join (
select t2.cust_id, t2.timestamp, count(distinct t3.prod_id) count_distinct_prod_id
from mytable t2
join mytable t3 on t3.cust_id = t2.cust_id
where unix_timestamp(t2.timestamp) - unix_timestamp(t3.timestamp) < 24*60*60
group by t2.cust_id, t2.timestamp
) t2 on t1.cust_id = t2.cust_id and t1.timestamp = t2.timestamp