Question

我想识别访问过a部分然后随后访问b的用户。给出以下数据结构。该表包含300,000行，每天更新一次。 8,000行：

**USERID**  **VISITID**     **SECTION**   Desired Solution--> **Conversion**
   1             1               a                                      0
   1             2               a                                      0
   2             1               b                                      0
   2             1               b                                      0
   2             1               b                                      0
   1             3               b                                      1

理想情况下，我想要一个标记访问b部分的新列。例如，在第三次访问时，用户1第一次访问了部分b。我试图使用CASE WHEN语句尝试这样做，但经过多次失败的尝试后，我不确定它是否可能与CASE WHEN一起并且觉得我应该采取不同的方法，我只是不确定应该采用什么方法。我也有一个日期专栏。

任何有关解决问题的新方法的建议都将受到赞赏。谢谢！

Answer 1

select t.*, case when v.ts is null then 0 else 1 end as conversion
  from tbl t
  left join (select *
               from tbl x
              where section = 'b'
                and exists (select 1
                       from tbl y
                      where y.userid = x.userid
                        and y.section = 'a'
                        and y.ts < x.ts)) v
    on t.userid = v.userid
   and t.visitid = v.visitid
   and t.section = v.section

<强>小提琴： http://sqlfiddle.com/#!15/5b954/5/0

我添加了样本时间戳数据，因为该字段是确定a是在b之前还是在b之后的必要条件。

要合并分析功能，您可以使用：

（我也是这样做的，只有第一次出现的B（在A之后）才会被标记为1）

select t.*,
       case
         when v.first_b_after_a is not null
         then 1
         else 0
        end as conversion
  from tbl t
  left join (select userid, min(ts) as first_b_after_a
               from (select t.*,
                            sum( case when t.section = 'a' then 1 end)
                                  over( partition by userid
                                        order by ts ) as a_sum
                       from tbl t) x
              where section = 'b'
                and a_sum is not null
              group by userid) v
    on t.userid = v.userid
   and t.ts = v.first_b_after_a

小提琴： http://sqlfiddle.com/#!1/fa88f/2/0

Answer 2

使用Redshift时，应该不惜一切代价避免相关的子查询。请记住，Redshift没有索引，因此您必须重新扫描并重新匹配列数据以替换父级中的每个值，从而导致O（n ^ 2）操作（在这个特殊情况从300 千值扫描到90 十亿）。

当您想要跨越一系列行时，最好的方法是使用分析函数。根据数据结构的不同，有几种选择，但在最简单的情况下，您可以使用类似

的内容

select case 
       when section != lag(section) over (partition by userid order by visitid)
       then 1
       else 0
       end
 from ...

这假设您的用户ID 2的数据增加了visitid，如下所示。如果没有，您也可以按时间戳列

订购

**USERID**  **VISITID**     **SECTION**   Desired Solution--> **Conversion**
   1             1               a                                      0
   1             2               a                                      0
   2             1               b                                      0
   2            *2*              b                                      0
   2            *3*              b                                      0
   1             3               b                                      1

如何根据先前的访问来识别后续用户操作

2 个答案: