Question

我正在处理申请人管道数据，需要统计进入管道/渠道每个阶段的申请人数。如果申请人跳过一个阶段，则无论如何我都需要对它们进行计数。这是一个数据如何寻找一位申请人的示例：

Stage name | Entered on
Application Review | 9/7/2018
Recruiter Screen | 9/10/2018
Phone Interview | blank
Interview | 9/17/2018
Interview 2 | 9/20/2018
Offer | blank

这是表格的样子：

CREATE TABLE application_stages (
application_id bigint,
stage_id bigint,
entered_on timestamp without time zone,
exited_on timestamp without time zone,
stage_name character varying
);

在此示例中，我要通过面试2（包括跳过/空白的电话面试阶段）来统计应用程序审核，而不是要约。我将如何用SQL编写以上内容？（数据存储在Amazon Redshift中。使用SQL工作台进行查询。）

此外，请让我知道是否还有什么可以添加到我的问题中的，以使问题/解决方案更清晰。

Answer 1

您可以像这样在event_list表中对流水线的各个阶段进行硬编码：

id | stage_name
1 | first stage 
2 | second stage 
3 | third stage 
4 | fourth stage

UPD：渠道的阶段越深，ID越高。这样，您可以比较它们，即third stage比second stage更深，因为3>2。因此，如果您需要找到达到第二阶段的人员，则该人员包括具有ID = 2的事件或具有ID> 2的事件的人，即在渠道深处的事件。

如果错过了second stage并为某人记录了third stage，则您可以通过stage_name将事件数据加入此表来将该人视为“已到达第二阶段”并用id>=2计算记录数，例如

select count(distinct user_id)
from event_data t1
join event_list t2
using (stage_name)
where t2.id>=2

或者，您可以将事件表保留到event_list并使用lag函数来填补空白，该函数返回上一行的值（即，将first stage的时间戳分配给second stage（在上述情况下）

Answer 2

这是我最后得到的SQL。感谢您的想法，@ AlexYes！

select stage_name,  
application_stages.application_id, entered_on, 
case when entered_on is NULL then lead(entered_on,1) 
ignore nulls
over 
(PARTITION BY application_stages.application_id order by case stage_name 
when 'Application Review' then 1 
when 'Recruiter Screen' then 2 
when 'Phone Interview' then 3
when 'Interview' then 4
when 'Interview 2' then 5
when 'Offer' then 6
when 'Hired' then 7 end) else entered_on end as for_count, exited_on
from application_stages

我意识到上面的SQL并没有给我计数，但是我正在Tableau中进行计数。如果我需要对新的“ for_count”字段进行其他计算，很高兴拥有上面的格式。

Redshift SQL-跳过序列

2 个答案: