在特定时间窗口内计算来自IP的会话

时间:2017-12-12 23:14:37

标签: mysql sql amazon-redshift

我正在使用Amazon Redshift

我有一个IP地址列表。可能存在来自同一IP地址的许多条目,但在特定时间窗口内具有不同的session_id(例如,为了参数,则为15分钟)。我想在这个时间窗口内为任何给定的IP地址计算这些会话。

换句话说,我想知道在任何15分钟的时间窗口内从某个IP地址登录的会话数。

所以我提出了以下问题:

SELECT t1.client_ip,
             COUNT(DISTINCT t2.session_id) AS sessions
      FROM t AS t1
        JOIN t AS t2
          ON t1.client_ip = t2.client_ip
         AND t2.created_at BETWEEN t1.created_at
         AND dateadd (MINUTE,15,t1.created_at)
      GROUP BY t1.client_ip
      HAVING COUNT(DISTINCT t2.session_id) >= 5
      ORDER BY t1.client_ip

不幸的是,查询花费的时间过长,也会返回错误的结果。必须有更好的方法来实现这一目标。表格中有大约1800万个不同的IP地址,表格本身有大约4亿条记录。

以下是一些示例数据:

Client_ip    Session_id    created_at
1.0.0.0       abc         <timestamp>
1.0.0.0       def         <timestamp> + 5 minutes
1.0.0.0       ghi         <timestamp> + 25 minutes
2.0.0.0       jkl         <timestamp1>
2.0.0.0       mno         <timestamp1> + 10 minutes
2.0.0.0       pqr         <timestamp1> + 20 minutes

必填结果:

Client_ip    #Sessions
1.0.0.0       2          (sessions abc and def)
2.0.0.0       2          (sessions mno and pqr)

非常感谢任何帮助。

修改

也许这个问题并不清楚。我为此道歉。

我不希望有一个设定的时间窗口,我可以创建相隔15分钟的时间间隔。我想从某个IP地址计算任意15分钟窗口中的会话数。

例如:在我发布的示例数据中,应计算会话mnopqr(因为它们的IP地址),因为它们彼此相差15分钟。同样,会话abcdef也应计入其各自的IP地址,因为它们彼此之间的距离不超过15分钟。我没有为它创建一个外部开始时间。理想情况下,查询应将每个记录与每个其他记录与相同的IP地址进行比较。不应该需要创建外部开始时间。

这是explain <query>

的输出
    XN Subquery Scan derived_table1  (cost=6516525010733.39..6516525010733.41 rows=2 width=524)
  ->  XN Merge  (cost=6516525010733.39..6516525010733.39 rows=2 width=1032)
        Merge Key: t1.client_ip
        ->  XN Network  (cost=6516525010733.39..6516525010733.39 rows=2 width=1032)
              Send to leader
              ->  XN Sort  (cost=6516525010733.39..6516525010733.39 rows=2 width=1032)
                    Sort Key: t1.client_ip
                    ->  XN HashAggregate  (cost=5516525010733.36..5516525010733.38 rows=2 width=1032)
                          Filter: (count(DISTINCT session_id) >= 10)
                          ->  XN Hash Join DS_DIST_BOTH  (cost=6284418.61..5516506756947.79 rows=2433838076 width=1032)
                                Outer Dist Key: t2.client_ip
                                Inner Dist Key: t1.client_ip
                                Hash Cond: (("outer".client_ip)::text = ("inner".client_ip)::text)
                                Join Filter: (("inner".created_at <= "outer".created_at) AND ("outer".created_at <= date_add('minute'::text, 15::bigint, "inner".created_at)))
                                ->  XN Seq Scan on fbs_page_view_staging t2  (cost=0.00..6279185.96 rows=2093062 width=1040)
                                      Filter: ((created_at <= '2017-09-30 00:00:00'::timestamp without time zone) AND (created_at >= '2017-09-01 00:00:00'::timestamp without time zone))
                                ->  XN Hash  (cost=6279185.96..6279185.96 rows=2093062 width=524)
                                      ->  XN Seq Scan on fbs_page_view_staging t1  (cost=0.00..6279185.96 rows=2093062 width=524)
                                            Filter: ((created_at <= '2017-09-30 00:00:00'::timestamp without time zone) AND (created_at >= '2017-09-01 00:00:00'::timestamp without time zone))
----- Tables missing statistics: fbs_page_view_staging -----
----- Update statistics by running the ANALYZE command on these tables -----

2 个答案:

答案 0 :(得分:0)

这就是我想到的......

SELECT t1.client_ip, t1.session_id, COUNT(DISTINCT t2.session_id)
FROM  ( SELECT client_ip, session_id, MIN(created_at) created_at
                     FROM   fbs_page
                     GROUP BY client_ip, session_id) AS t1 
       INNER JOIN (SELECT client_ip, session_id, MIN(created_at) created_at
                     FROM   fbs_page
                     GROUP BY client_ip, session_id) AS t2
         ON t1.client_ip = t2.client_ip
            AND t1.session_id != t2.session_id 
            AND t1.created_at 
            BETWEEN DATEADD(MINUTE,-15,t2.created_at) AND t2.created_at
GROUP BY t1.client_ip, t1.session_id
ORDER  BY 1, 2;

经过一番讨论后,我认为这可能与您的需求接近。您可以根据需要添加where子句来过滤结果,以缩短日期范围或其他事项,以使其运行得更快。

答案 1 :(得分:0)

SELECT t1.client_ip, t1.WindowStart, COUNT(DISTINCT t1.session_id) AS sessions
FROM (
        SELECT DISTINCT client_ip, 
                        created_at as WindowStart, 
                        DATEADD(MINUTE,15,created_at) as WindowEnd
        FROM t
        -- Add a where clause in here if you want to reduce the number of rows that you're working with
        -- e.g. WHERE created_at BETWEEN 'some_arbitrary_date' AND 'another_arbitrary_date'
     ) t1
  INNER JOIN t as t2 ON t1.client_ip = t2.client_ip 
                    AND t2.created_at BETWEEN t1.WindowStart AND t1.WindowEnd
GROUP BY t1.client_ip, t1.WindowStart