使用MIN()JOIN,其中MIN()大于连接左侧的值

时间:2014-02-13 21:01:31

标签: mysql sql postgresql

我正在尝试对(少数)SQL数据库中的现有数据加载进行一些转换分析。

数据结构本身非常简单。它只是一个演员列表(想想user_id)和他们所做的事情的名称。它看起来像这样(还有其他数据,但不会在此查询中使用):

CREATE TABLE views(
    project_id integer not null,
    name varchar(128) not null,
    datetime timestamp not null,
    actor varchar(256) not null
)

目标是标准的转化分析。执行动作A的人数,然后执行B,C,D,E等,以及步骤之间的平均时间。

为清楚起见,漏斗步骤决定了顺序,但不是排他性。例如,寻找名字A,B,C的漏斗应该包括一个演员,其序列是B,A,B,D,C(因为它包含A,然后是B,然后是C,即使有之间的步骤)。

目前我正在使用以下内容查询此表(每个联接代表转化渠道的下一步):

SELECT count(actor), count(span2), avg(span2), count(span3), avg(span3), count(span4), avg(span4), count(span5), avg (span5)
FROM
(
    SELECT e1.actor, 
        DATEDIFF(SECOND, MIN(e1.datetime), MIN(e2.datetime)) AS span2,
        DATEDIFF(SECOND, MIN(e2.datetime), MIN(e3.datetime)) AS span3,
        DATEDIFF(SECOND, MIN(e3.datetime), MIN(e4.datetime)) AS span4,
        DATEDIFF(SECOND, MIN(e4.datetime), MIN(e5.datetime)) AS span5
    FROM views AS e1
    LEFT JOIN (SELECT actor, MIN(datetime) as datetime FROM views WHERE name = 'Action 2' group by actor) as e2 ON e1.actor = e2.actor AND e2.datetime > e1.datetime
    LEFT JOIN (SELECT actor, MIN(datetime) as datetime FROM views WHERE name = 'Action 3' group by actor) as e3 ON e1.actor = e3.actor AND e3.datetime > e2.datetime
    LEFT JOIN (SELECT actor, MIN(datetime) as datetime FROM views WHERE name = 'Action 4' group by actor) as e4 ON e1.actor = e4.actor AND e4.datetime > e3.datetime
    LEFT JOIN (SELECT actor, MIN(datetime) as datetime FROM views WHERE name = 'Action 5' group by actor) as e5 ON e1.actor = e5.actor AND e5.datetime > e4.datetime
    WHERE e1.project_id = 1 and e1.name = 'Action 1'
    GROUP BY e1.actor
) AS aggregates

这在数据集上非常快(< 1s on 10M rows)。问题是它实际上并不是正确的结果。子选择的连接每次都要求MIN(日期时间)。如果演员序列按顺序B,A,B发生,则不计入MIN(A)大于MIN(B)。

给定一组执行了视图列表的actor,我需要检查每个actor是否已经执行了视图A,然后再查看B,然后再查看C,无论他们在中间。 B,A,B,C符合条件,A,B,B,C符合条件,A,B,Z,C符合条件,A,Z,C不符合

要“正确”查询,我可以删除子连接中的MIN(日期时间),并在连接外部执行MIN()。然而,这需要非常长的时间,因为每个漏斗步骤多次连接每一行(步骤通常不按顺序重复)。在这种情况下,交叉产品是巨大的 - 查询规划者说21亿千万行! (21,666,755,307,950,608)。这显然不再是次要的1秒查询。

我想要实现的是一个连接,其中连接发生在MIN值上,但MIN值是“MIN值大于前一个连接步骤”。即所以对于步骤A到B,B.datetime是单个MIN B.datetime,它仍然大于A.datetime。像(不是有效的SQL!):

.... 
LEFT JOIN (SELECT actor, datetime FROM views WHERE name = 'Action 2') as e2 
ON e1.actor = e2.actor AND e2.datetime > e1.datetime HAVING MIN(e.datetime)
....

有关如何实现这一目标的任何建议?

如果合适,特定于MySQL或PostgreSQL的函数都可以。

2 个答案:

答案 0 :(得分:2)

我建议只查看所有过渡时间。以下是在SQL中执行此操作的方法:

SELECT prevName, name, count(*) as NumTransitions,
       avg(DATEDIFF(SECOND, "datetime", prevdatetime))
FROM (SELECT e1.actor, "datetime", name,
             lag(name) over (partition by actor order by "datetime") as prevName,
             lag("datetime") over (partition by actor order by "datetime") as prevDateTime
      FROM views AS e1
      WHERE e1.project_id = 1 
     ) t
GROUP BY prevName, name;

如果您想要每次转换的“演员”数量,可以添加count(distinct actor)

答案 1 :(得分:0)

乍一看,我认为这是你用于内联视图的sql。

请尝试以下方法。如果没有,你能发布一些示例数据和期望的结果吗?

SELECT count(actor), count(span2), avg(span2), count(span3), avg(span3), count(span4), avg(span4), count(span5), avg (span5)
FROM
(
    SELECT e1.actor, 
        DATEDIFF(SECOND, MIN(e1.datetime), MIN(e2.datetime)) AS span2,
        DATEDIFF(SECOND, MIN(e2.datetime), MIN(e3.datetime)) AS span3,
        DATEDIFF(SECOND, MIN(e3.datetime), MIN(e4.datetime)) AS span4,
        DATEDIFF(SECOND, MIN(e4.datetime), MIN(e5.datetime)) AS span5
FROM views as e1, views as e2, views as e3, views as e4, views as e5
where e1.actor = e2.actor and e1.actor = e3.actor and e1.actor = e4.actor and e1.actor = e5.actor
  and e2.datetime = (select min(x.datetime) from views x where x.name = 'Action 2' and x.actor = e2.actor and x.datetime > e1.datetime)
  and e3.datetime = (select min(x.datetime) from views x where x.name = 'Action 3' and x.actor = e3.actor and x.datetime > e2.datetime)
  and e4.datetime = (select min(x.datetime) from views x where x.name = 'Action 4' and x.actor = e4.actor and x.datetime > e3.datetime)
  and e5.datetime = (select min(x.datetime) from views x where x.name = 'Action 5' and x.actor = e5.actor and x.datetime > e4.datetime)
  and e1.project_id = 1 and e1.name = 'Action 1'
    GROUP BY e1.actor
) AS aggregates