Question

我正在尝试将用户彼此关联，并为网站访问者分配一个公共ID。

我有行（称之为表 a ）a.UUID, a.seen_time, a.ip_address, a.user_id, a.subdomain，我试图想出一个a.matched_id如果行IP地址是最后的+/- 4小时（即连续），将一个matched_id分配给这些行。

请注意，就我的目的而言，2个不同子域上的IP不一定是相同的匹配，除非它们具有相同的用户ID。

这是我在常规编程语言中遵循的基本过程（但是我需要构造SQL）：

获取必要的表格 a
对于每一行，如果任何行永远具有匹配的user_id（子域无关紧要），请为它们分配相同的matched_id（其他条件相同，让我们使用{{1 }}）
分区到子域集。

对于每个子域分区：
- 现在分区到每个行所在的IP地址桶中。距离之前（/之后）的see_time 4小时（即逐行）
  
  对于每个IP地址分区：
  - 如果任何1个项目已经有MIN(uuid)，请将其分配给所有人。否则，为所有人matched_id分配一个新的matched_id。继续。

我使用的是Amazon Redshift，它或多或少与Postgres一样查询，但有一些限制（如果有兴趣，请参阅unsupported features和unsupported functions）：接受Postgres / ANSI SQL答案。< / p>

如何以有效的方式构建此查询？

我必须遵循的基本SQL流程是什么？

由于

- 更新 -

我已经取得了以下进展：

我不知道效率如何
我使用上面引用的MIN(uuid)而不是discovery_time，并使用了表seen_time而不是mydata，尽管有时将其别名为a并且a
它使用MD5代替b，因为我相信获取该信息需要另一个查询 - 无论如何，它并不重要
关键问题：不计算最后一行的+/- 4小时'而不是绝对

代码：

MIN(UUID)

如果您想要测试，以下创建脚本应该是您：

--UPDATE mydata m SET matched_id = NULL; --for testing

WITH cte1 AS (
    --start with the max discovery time and go down from there
    --select the matched id if one already exists
    SELECT m.ip, m.subdomain, MAX(m.discovery_time) AS max_discovery_time, 
        CASE WHEN MIN(m.user_id) IS NOT NULL THEN MD5(MIN(m.user_id)) 
        ELSE MIN(m.matched_id) END AS known_matched_id
    FROM mydata m
    GROUP BY m.ip, m.subdomain

    ), cte2 AS (

    SELECT m.uuid, CASE WHEN c.known_matched_id IS NOT NULL THEN c.known_matched_id 
        ELSE MD5(CONCAT(c.ip, c.subdomain, c.max_discovery_time)) END AS matched_id
    FROM mydata m 
    --IP on different subdomains are not necessarily the same match
    RIGHT OUTER JOIN cte1 c ON CONCAT(c.ip, c.subdomain) = CONCAT(m.ip, m.subdomain) 
    WHERE m.discovery_time >= (c.max_discovery_time - INTERVAL '4 hours')
    --Does not work 'row by row' instead in terms of absolutes - need to make this recursive somehow,
    --but Redshift does not support recursive CTEs or user-defined functions
)

UPDATE mydata m
SET matched_id = c.matched_id
FROM cte2 c
WHERE c.uuid = m.uuid;

--view result for an example IP
SELECT m.discovery_time, m.ip, m.matched_id, m.uuid 
FROM mydata m
WHERE m.ip = '12.34.56.78'
ORDER BY m.ip, m.discovery_time;

然后预期输出将为所有这些行分配相同的CREATE TABLE mydata ( ip character varying(255), subdomain character varying(255), matched_id character varying(255), user_id character varying(255), uuid character varying(255) NOT NULL, discovery_time timestamp without time zone, CONSTRAINT pk_mydata PRIMARY KEY (uuid) ); -- should all get the same matched_id in result, except the 1st INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '222b5991-9780-11e3-9304-127b2ab15ea7', '2014-02-14 00:03:26'); INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '333b5991-9780-11e3-9304-127b2ab15ea7', '2014-02-16 22:22:26'); INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '379b641b-9782-11e3-9304-127b2ab15ea7', '2014-02-17 03:18:48'); INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, 'ac0f6416-977e-11e3-9304-127b2ab15ea7', '2014-02-17 02:53:25'); INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '11fb5991-9780-11e3-9304-127b2ab15ea7', '2014-02-17 03:03:26'); INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '849d8d61-9781-11e3-9304-127b2ab15ea7', '2014-02-17 03:13:48');，除了第一个（在INSERT行中），因为它的时间超过4小时下一个最近看到的时间（也没有matched_id与其他任何人匹配）。

- 更新2 -

连续的逐行结果仍然没有多少运气。如果重复运行，这个版本似乎可以这样工作，但
有兴趣提高效率
新列user_id和min_time表示4小时套装中的最短和最长时间

代码：

max_time

Answer 1

我不确定我是否理解这个问题，但是从看来，如果行IP地址是最后的+/- 4小时，则需要每个IP的“最后”时间地址（或IP + UUID，不确定）。你来自

select ip_address, max(seen_time) group by ip_address

您可以从中创建虚拟表或使用相关子查询，请参阅下一步。

我不是Postgres用户，但肯定有一个测量小时数的功能。作为草图，

select * from a as A 
where exists (
    select 1 from a 
    where ip_address = A.ip_address
    and   UUID = A.UUID
    group by ip_address, UUID
    having hour(max(seen_time)) - hour(A.seen_time) < 4
)

HTH。

Answer 2

我建议：

添加用于工作的列a：id_1，id_2，min_time，max_time

将id_1更新为具有相同min(uuid)的所有记录的user_id。像这样：

 -- match any records with a userid
 update a 
 set id_1 = x.uuid 
 from a 
 inner join (   
        select min(uuid) as uuid, userid 
        from a where userid is not null group by userid ) as x
   on a.userId = x.userId

将列min_time和max_time更新为last_seen减去/加4小时。您可以在下一个查询中执行所有这些操作，但是如果您稍后重新使用这些值，则仅计算一次会更有效。

update a 
set min_time = seen_time - interval '4 hour'
,   max_time = seen_time + interval '4 hour'

加入自身，匹配ip和subdomain的记录，其中a.seen_time在另一条记录的4小时内。例如：

update a 
set id_2 = other_uuid
from ( 

    -- join a onto all matching records by ip and subdomain
    -- where a.seen_time within 4 hours of the other record.
    select a.uuid, min(other.uuid) as other_uuid 
    from a 
    inner join a AS other
    on a.ip_address = other.ip_address
    and a.subdomain = other.subdomain
    and a.uuid <> other.uuid
    where a.seen_time > other.min_time
    and a.seen_time < other.max_time
    group by a.uuid
) AS matching 
where a.uuid = matching.uuid
-- no need to match ones already matched on userid
and id_1 is null

现在id_1和id_2合并就是你想要的。

在SQL查询中将用户相互关联

2 个答案: