根据时间间隔中的事件计数在SQL中选择记录

时间:2018-03-28 10:05:18

标签: sql select count subquery vertica

首先,我的一个例子:

+---------+-----------------+------------+-----------------+---------------------+
| user_id |      email      | home_phone | incoming_number |      date_time      |
+---------+-----------------+------------+-----------------+---------------------+
|       1 | dan@dan.com     |    8893432 |         5453455 | 2018-03-27 13:48:10 |
|       1 | dan@dan.com     |    8893432 |        65765489 | 2018-03-27 13:47:10 |
|       1 | dan@dan.com     |    8893432 |        65765489 | 2018-03-27 13:48:05 |
|       2 | sam@sam.com     |   16568675 |        65658403 | 2018-03-27 13:46:05 |
|       2 | sam@sam.com     |   16568675 |        57575748 | 2018-03-27 13:32:05 |
|       2 | sam@sam.com     |   16568675 |        76547946 | 2018-03-27 13:43:05 |
|       3 | allen@allen.com |   12345678 |        85768576 | 2018-03-27 13:46:05 |
|       3 | allen@allen.com |   12345678 |        65658403 | 2018-03-27 13:42:05 |
|       3 | allen@allen.com |   12345678 |        76547946 | 2018-03-27 13:43:05 |
|       3 | allen@allen.com |   12345678 |        76547946 | 2018-03-27 13:20:05 |
+---------+-----------------+------------+-----------------+---------------------+

我想要完成什么?

我想选择在10分钟的时间范围内至少有3个不同的incoming_number值的所有三元组(user_id, email, home_phone)。 例如,在上表中,结果仅为(3,allen@allen.com,12345678)。第一个用户只有两个不同的incoming_number值,第二个用户的时间范围> 1。 10分钟

注意: 传入的号码可以使用不同的date_time值多次出现。

每个user_id只有1封电子邮件,只有1封家庭电话。

到目前为止我尝试了什么? 我想也许我应该将3个第一列视为1个键?也许在incoming_number上有所不同,并以某种方式解决它?没有太多想法。

什么是SQL查询才能解决这个问题?

1 个答案:

答案 0 :(得分:2)

如果我理解你的话,你的小组中没有一个满足这两个标准:3个不同的incoming_number-s和上次和第一次通话之间的持续时间少于10分钟。因此,为了便于说明,我添加了一个满足这两个标准的电子邮箱match @match.com。下面的查询包含WITH子句中的数据,以及在最终报告中一起获取条件的所有中间结果。删除HAVING子句以检查那些不符合条件的行中的结果....

快乐的玩耍

WITH
input(         user_id,email            ,home_phone,incoming_number,date_time) AS (
          SELECT     1,'dan@dan.com'    , 8893432  , 5453455       ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT     1,'dan@dan.com'    , 8893432  ,65765489       ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT     1,'dan@dan.com'    , 8893432  ,65765489       ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,65658403       ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,57575748       ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,76547946       ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,85768576       ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,65658403       ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,76547946       ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,76547946       ,TIMESTAMP '2018-03-27 13:20:05'
UNION ALL SELECT     4,'match@match.com',62345677  ,85768576       ,TIMESTAMP '2018-03-27 13:11:05'
UNION ALL SELECT     4,'match@match.com',62345677  ,65658403       ,TIMESTAMP '2018-03-27 13:13:05'
UNION ALL SELECT     4,'match@match.com',62345677  ,76547946       ,TIMESTAMP '2018-03-27 13:18:05'
UNION ALL SELECT     4,'match@match.com',62345677  ,76547946       ,TIMESTAMP '2018-03-27 13:20:05'
)
SELECT
  user_id
, email
, home_phone
, MAX(date_time) - MIN(date_time) duration
, MAX(date_time) end_ts
, MIN(date_time) start_ts
, COUNT(DISTINCT incoming_number) incoming_number_count
FROM input
GROUP BY
  user_id
, email
, home_phone
HAVING MAX(date_time) - MIN(date_time) < INTERVAL '10 minutes'
   AND COUNT(DISTINCT incoming_number) >=3
;
user_id|email          |home_phone|duration         |end_ts             |start_ts           |incoming_number_count
      4|match@match.com|62,345,677|0 00:09:00.000000|2018-03-27 13:20:05|2018-03-27 13:11:05|    

第二个答案 - 现在看到你所追求的是什么,但保留原来的那个:

在您描述的情况下,我们需要沿着OLAP路径前进。

我们从date_time列中减去第二个前面的date_time(使用LAG()),并且由于在Vertica中不支持COUNT(DISTINCT col)OVER(),我们使用Vertica的特定CONDITIONAL_CHANGE_EVENT()OLAP函数来计算incoming_number改变的频率,如果它永远不改变则得到0,如果改变一次或两次则得到1和2,如果改变两次则给出3个不同的incoming_number-s:

WITH
input(         user_id,email            ,home_phone,incoming_number,date_time) AS (
          SELECT     1,'dan@dan.com'    , 8893432  , 5453455       ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT     1,'dan@dan.com'    , 8893432  ,65765489       ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT     1,'dan@dan.com'    , 8893432  ,65765489       ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,65658403       ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,57575748       ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,76547946       ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,85768576       ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,65658403       ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,76547946       ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,76547946       ,TIMESTAMP '2018-03-27 13:20:05'
)
,
w_filter_val AS (
SELECT
  *
, date_time - LAG(date_time,2) OVER(PARTITION BY user_id ORDER BY date_time) AS time4these3
, CONDITIONAL_CHANGE_EVENT(incoming_number) OVER(PARTITION BY user_id ORDER BY incoming_number) AS count_in_nbr_minus1
FROM input
)
SELECT * FROM w_filter_val ORDER BY 1;

 user_id |      email      | home_phone | incoming_number |      date_time      | time4these3 | count_in_nbr_minus1
---------+-----------------+------------+-----------------+---------------------+-------------+---------------------
       1 | dan@dan.com     |    8893432 |         5453455 | 2018-03-27 13:48:10 | 00:01       |                   0
       1 | dan@dan.com     |    8893432 |        65765489 | 2018-03-27 13:47:10 |             |                   1
       1 | dan@dan.com     |    8893432 |        65765489 | 2018-03-27 13:48:05 |             |                   1
       2 | sam@sam.com     |   16568675 |        57575748 | 2018-03-27 13:32:05 |             |                   0
       2 | sam@sam.com     |   16568675 |        65658403 | 2018-03-27 13:46:05 | 00:14       |                   1
       2 | sam@sam.com     |   16568675 |        76547946 | 2018-03-27 13:43:05 |             |                   2
       3 | allen@allen.com |   12345678 |        65658403 | 2018-03-27 13:42:05 |             |                   0
       3 | allen@allen.com |   12345678 |        76547946 | 2018-03-27 13:20:05 |             |                   1
       3 | allen@allen.com |   12345678 |        76547946 | 2018-03-27 13:43:05 | 00:23       |                   1
       3 | allen@allen.com |   12345678 |        85768576 | 2018-03-27 13:46:05 | 00:04       |                   2

最后,我们需要做的就是过滤不到10分钟的持续时间和3个或更多的incoming_number-s

WITH
input(         user_id,email            ,home_phone,incoming_number,date_time) AS (
          SELECT     1,'dan@dan.com'    , 8893432  , 5453455       ,TIMESTAMP '2018-03-27 13:48:10'
UNION ALL SELECT     1,'dan@dan.com'    , 8893432  ,65765489       ,TIMESTAMP '2018-03-27 13:47:10'
UNION ALL SELECT     1,'dan@dan.com'    , 8893432  ,65765489       ,TIMESTAMP '2018-03-27 13:48:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,65658403       ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,57575748       ,TIMESTAMP '2018-03-27 13:32:05'
UNION ALL SELECT     2,'sam@sam.com'    ,16568675  ,76547946       ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,85768576       ,TIMESTAMP '2018-03-27 13:46:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,65658403       ,TIMESTAMP '2018-03-27 13:42:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,76547946       ,TIMESTAMP '2018-03-27 13:43:05'
UNION ALL SELECT     3,'allen@allen.com',12345678  ,76547946       ,TIMESTAMP '2018-03-27 13:20:05'
)
,
w_filter_val AS (
SELECT
  *
, date_time - LAG(date_time,2) OVER(PARTITION BY user_id ORDER BY date_time) AS time4these3
, CONDITIONAL_CHANGE_EVENT(incoming_number) OVER(PARTITION BY user_id ORDER BY incoming_number) AS count_in_nbr_minus1
FROM input
)
SELECT * FROM w_filter_val WHERE time4these3 <= '10 MINUTES' AND count_in_nbr_minus1 + 1 >= 3
;

 user_id |      email      | home_phone | incoming_number |      date_time      | time4these3 | count_in_nbr_minus1 
---------+-----------------+------------+-----------------+---------------------+-------------+---------------------
       3 | allen@allen.com |   12345678 |        85768576 | 2018-03-27 13:46:05 | 00:04       |                   2