我有一个BigQuery表,如下所示:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time hits_eventInfo_Label
20151021 Air Search 1445001 A232 1952 City1
20151021 Air Select 1445001 A232 2300 Vendor1
20151021 Air Search 1445001 A111 1000 City2
20151021 Air Search 1445001 A111 1900 City3
20151021 Air Select 1445001 A111 7380 Vendor2
20151021 Air Search 1445001 A580 1000 City4
20151021 Air Search 1445001 A580 1900 City5
20151021 Air Search 1445001 A580 1900 City6
20151021 Air Select 1445001 A580 7380 Vendor3
该表显示了3个用户的用户活动 - A232,A111和A580,以便:
i) A232 - Made 1 Search at 'City1' and chose 'Vendor1' from 'City1'
ii) A111 - Made the 1st search at 'City2' and did not choose any vendor from there. Made a 2nd search at 'City3' and then ultimately chose a 'Vendor2' from here.
iii) A580 - 1st search at 'City4', no vendor chosen. 2nd search at 'City5', no vendor chosen. 3rd search at 'City6', 'Vendor3' chosen from City6.
我感兴趣的是只检索用户实际选择供应商的城市,也就是说,对用户之前未选择供应商的搜索不感兴趣。
必需的输出表:
date hits_eventInfo_Category hits_eventInfo_Action session_id user_id hits_time city vendor
20151021 Air Search 1445001 A232 1952 City1 Vendor1
20151021 Air Search 1445001 A111 1900 City3 Vendor2
20151021 Air Search 1445001 A580 1900 City6 Vendor3
在user_id上进行分区并按hits_time排序后,我一直尝试使用LAG函数在hits_eventInfo_eventLabel字段上执行此操作,即LAG(hits_eventInfo_eventLabel,1) OVER( PARTITION BY user_id ORDER BY hits_time)
然而,由于我使用滞后偏移量为1,上面的表达式帮助我只获得用户A232的所需输出(因为他只进行了1次搜索,这意味着在选择供应商之前的前一条记录肯定是搜索记录)。
有没有办法可以让这个滞后表达式更具动态性,以便在进行选择之前只检索搜索到的直接位置 - 无论在选择之前进行了多少次搜索?
OR
我可以采取其他功能/途径来实现这一目标吗?
答案 0 :(得分:1)
select
date,
hits_eventInfo_Category,
hits_eventInfo_Action,
session_id,
user_id,
hits_time,
prev as city,
hits_eventInfo_Label as vendor
from (
select *,
lag(hits_eventInfo_Label, 1) over(partition by user_id order by hits_time) as prev
from dataset.table
)
where hits_eventInfo_Action = 'Select'