我正在尝试将用户彼此关联,并为网站访问者分配一个公共ID。
我有行(称之为表 a )a.UUID, a.seen_time, a.ip_address, a.user_id, a.subdomain
,我试图想出一个a.matched_id
如果行IP地址是最后的+/- 4小时(即连续),将一个matched_id
分配给这些行。
请注意,就我的目的而言,2个不同子域上的IP不一定是相同的匹配,除非它们具有相同的用户ID。
这是我在常规编程语言中遵循的基本过程(但是我需要构造SQL):
matched_id
(其他条件相同,让我们使用{{1 }})分区到子域集。
对于每个子域分区:
现在分区到每个行所在的IP地址桶中。距离之前(/之后)的see_time 4小时(即逐行)
对于每个IP地址分区:
MIN(uuid)
,请将其分配给所有人。否则,为所有人matched_id
分配一个新的matched_id
。继续。我使用的是Amazon Redshift,它或多或少与Postgres一样查询,但有一些限制(如果有兴趣,请参阅unsupported features和unsupported functions):接受Postgres / ANSI SQL答案。< / p>
如何以有效的方式构建此查询?
我必须遵循的基本SQL流程是什么?
由于
- 更新 -
我已经取得了以下进展:
MIN(uuid)
而不是discovery_time
,并使用了表seen_time
而不是mydata
,尽管有时将其别名为a
并且a
b
,因为我相信获取该信息需要另一个查询 - 无论如何,它并不重要代码:
MIN(UUID)
如果您想要测试,以下创建脚本应该是您:
--UPDATE mydata m SET matched_id = NULL; --for testing
WITH cte1 AS (
--start with the max discovery time and go down from there
--select the matched id if one already exists
SELECT m.ip, m.subdomain, MAX(m.discovery_time) AS max_discovery_time,
CASE WHEN MIN(m.user_id) IS NOT NULL THEN MD5(MIN(m.user_id))
ELSE MIN(m.matched_id) END AS known_matched_id
FROM mydata m
GROUP BY m.ip, m.subdomain
), cte2 AS (
SELECT m.uuid, CASE WHEN c.known_matched_id IS NOT NULL THEN c.known_matched_id
ELSE MD5(CONCAT(c.ip, c.subdomain, c.max_discovery_time)) END AS matched_id
FROM mydata m
--IP on different subdomains are not necessarily the same match
RIGHT OUTER JOIN cte1 c ON CONCAT(c.ip, c.subdomain) = CONCAT(m.ip, m.subdomain)
WHERE m.discovery_time >= (c.max_discovery_time - INTERVAL '4 hours')
--Does not work 'row by row' instead in terms of absolutes - need to make this recursive somehow,
--but Redshift does not support recursive CTEs or user-defined functions
)
UPDATE mydata m
SET matched_id = c.matched_id
FROM cte2 c
WHERE c.uuid = m.uuid;
--view result for an example IP
SELECT m.discovery_time, m.ip, m.matched_id, m.uuid
FROM mydata m
WHERE m.ip = '12.34.56.78'
ORDER BY m.ip, m.discovery_time;
然后预期输出将为所有这些行分配相同的CREATE TABLE mydata
(
ip character varying(255),
subdomain character varying(255),
matched_id character varying(255),
user_id character varying(255),
uuid character varying(255) NOT NULL,
discovery_time timestamp without time zone,
CONSTRAINT pk_mydata PRIMARY KEY (uuid)
);
-- should all get the same matched_id in result, except the 1st
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '222b5991-9780-11e3-9304-127b2ab15ea7', '2014-02-14 00:03:26');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '333b5991-9780-11e3-9304-127b2ab15ea7', '2014-02-16 22:22:26');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '379b641b-9782-11e3-9304-127b2ab15ea7', '2014-02-17 03:18:48');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, 'ac0f6416-977e-11e3-9304-127b2ab15ea7', '2014-02-17 02:53:25');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '11fb5991-9780-11e3-9304-127b2ab15ea7', '2014-02-17 03:03:26');
INSERT INTO mydata (ip, subdomain, matched_id, user_id, uuid, discovery_time) VALUES ('12.34.56.78', 'sub1', NULL, NULL, '849d8d61-9781-11e3-9304-127b2ab15ea7', '2014-02-17 03:13:48');
,除了第一个(在INSERT行中),因为它的时间超过4小时下一个最近看到的时间(也没有matched_id
与其他任何人匹配)。
- 更新2 -
user_id
和min_time
表示4小时套装中的最短和最长时间代码:
max_time
答案 0 :(得分:3)
我不确定我是否理解这个问题,但是从看来,如果行IP地址是最后的+/- 4小时,则需要每个IP的“最后”时间地址(或IP + UUID,不确定)。你来自
select ip_address, max(seen_time) group by ip_address
您可以从中创建虚拟表或使用相关子查询,请参阅下一步。
我不是Postgres用户,但肯定有一个测量小时数的功能。作为草图,
select * from a as A
where exists (
select 1 from a
where ip_address = A.ip_address
and UUID = A.UUID
group by ip_address, UUID
having hour(max(seen_time)) - hour(A.seen_time) < 4
)
HTH。
答案 1 :(得分:2)
我建议:
添加用于工作的列a
:id_1
,id_2
,min_time
,max_time
将id_1
更新为具有相同min(uuid)
的所有记录的user_id
。像这样:
-- match any records with a userid
update a
set id_1 = x.uuid
from a
inner join (
select min(uuid) as uuid, userid
from a where userid is not null group by userid ) as x
on a.userId = x.userId
将列min_time
和max_time
更新为last_seen减去/加4小时。您可以在下一个查询中执行所有这些操作,但是如果您稍后重新使用这些值,则仅计算一次会更有效。
update a
set min_time = seen_time - interval '4 hour'
, max_time = seen_time + interval '4 hour'
加入自身,匹配ip
和subdomain
的记录,其中a.seen_time
在另一条记录的4小时内。例如:
update a
set id_2 = other_uuid
from (
-- join a onto all matching records by ip and subdomain
-- where a.seen_time within 4 hours of the other record.
select a.uuid, min(other.uuid) as other_uuid
from a
inner join a AS other
on a.ip_address = other.ip_address
and a.subdomain = other.subdomain
and a.uuid <> other.uuid
where a.seen_time > other.min_time
and a.seen_time < other.max_time
group by a.uuid
) AS matching
where a.uuid = matching.uuid
-- no need to match ones already matched on userid
and id_1 is null
现在id_1
和id_2
合并就是你想要的。