我有一组数据是通过将类似的子项匹配在一起创建的,然后按"类别"对这些类似的项目进行分组。
现在,必须以这样的方式匹配结果类别,即在每个" group_id"中将相关类别组合在一起。在下面的示例中,一个匹配是A-> B-> C-> D-> E-> F-> G,其通过递归行获得。
我发布了我的current answer,它适用于这个简单的数据集,但因为实际的数据集最多包含1M行,每个" group_id最多可能有60个类别"此查询导致"脱离假脱机空间"真实数据错误。
请注意:
正确答案
示例输入:
期望的输出:
答案 0 :(得分:1)
你需要一个递归的apporach,但你的WITH RECURSIVE
会产生一个巨大的中间结果,导致不再是假脱机。
对于类似的过程,我使用了以下方法(最初在存储过程中使用WHILE-loop):
CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log AS
(
SELECT group_id, category_1, category_2,
-- assign a unique number to
Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk
-- remove when you source data is unique
GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
FROM match_detail
)
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;
现在重复以下更新,直到0 rows processed
:
-- find matching categories and assign them a common number
UPDATE vt_tmp FROM
( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
FROM vt_tmp e1 JOIN vt_tmp e2
ON e1.category_2 = e2.category_2
AND e1.rnk < e2.rnk
GROUP BY e2.group_id, e2.category_1
) x
SET rnk = minrnk
WHERE
vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;
要获得您最终需要的相关类别:
SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk
FROM vt_tmp
要与预期结果完全匹配,您需要添加DENSE_RANK
:
SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
(
SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk
FROM vt_tmp
) AS dt
答案 1 :(得分:1)
这种解决方案的概念是在穿过边缘时避免循环
它是通过在运行中计算路径的位图并避免添加其category_2已经位于位图中的边来完成的。
所有路径都以自引用边(例如'A' - 'A')开始,以便获取第一个节点位图。
此解决方案将迭代子查询( edge )的结果记录数减少到70.
此解决方案的局限性在于应预先定义位图大小
varbinary文字中的每个数字代表4位
64位='0000000000000000'xb
128位='00000000000000000000000000000000'xb
等
with category_group_bitmap (group_id,category_1,bitflag_group)
as
(
select group_id
,category_1
,dense_rank () over
(
partition by group_id
order by sum (distinct category_2_n)
) as bitflag_group
from edges
group by group_id
,category_1
)
,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
as
(
select 1 as n
,m.group_id
,m.category_1
,m.category_2
,gc.categories_num
,setbit ('0000000000000000'xb,m.category_2_n) as bitmap
,m.category_2_n
from match_detail_category_2_n as m
join group_categories_num as gc
on gc.group_id =
m.group_id
where m.category_1 =
m.category_2
union all
select e.n + 1 as n
,e.group_id
,e.category_1
,m.category_2
,e.categories_num
,setbit (e.bitmap,m.category_2_n) as bitmap
,m.category_2_n
from edges as e
join match_detail_category_2_n as m
on m.group_id =
e.group_id
and m.category_1 =
e.category_2
where e.n < e.categories_num - 1
and getbit (e.bitmap,m.category_2_n) = 0
)
,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
as
(
select m.group_id
,category_1
,category_2
,cast
(
dense_rank () over
(
partition by group_id
order by category_2
) - 1
as byteint
)
from match_detail as m
)
,group_categories_num (group_id,categories_num)
as
(
select group_id
,count (distinct category_1)
from match_detail
group by group_id
)
select *
from category_group_bitmap
;
答案 2 :(得分:0)
这样可行,但会造成“假脱机空间”#34;真实数据问题。
架构创建:
CREATE VOLATILE TABLE match_detail (
group_id bigint
, category_1 varchar(255)
, category_2 varchar(255)
) PRIMARY INDEX (group_id)
ON COMMIT PRESERVE ROWS;
INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');
查询:
WITH
related_cats AS (
SELECT
group_id
, category_1
, SUM(DISTINCT bitflag) As bitflag_total
, DENSE_RANK() OVER (
PARTITION BY
group_id
ORDER BY
group_id
, bitflag_total
) As bitflag_group
FROM bitflags
GROUP BY 1, 2
)
, bitflags As (
SELECT
DISTINCT
group_id
, category_1
, category_2
, CAST
(
2 ** (DENSE_RANK() OVER (
PARTITION BY
group_id
ORDER BY group_id
, category_2) - 1)
As bigint
) As bitflag
FROM cat_join
WHERE depth = 1
)
, RECURSIVE cat_join AS (
SELECT DISTINCT
c1.group_id
, c1.category_1
, c1.category_2
, CAST
(
n.num_categories - 1
As integer
) As max_depth
, CASE
WHEN c1.category_1 = c1.category_2 THEN 1
ELSE max_depth
END As depth
, 1 As recursion
FROM matches c1
INNER JOIN num_categories n
ON c1.group_id = n.group_id
UNION ALL
SELECT
r1.group_id
, r1.category_1
, r2.category_2
, r1.max_depth
, CASE
WHEN r1.category_1 = r1.category_2 THEN 1
WHEN r1.category_1 = r2.category_2 THEN 1
ELSE r1.depth - 1
END As cur_depth
, recursion + 1
FROM cat_join r1
INNER JOIN matches r2
ON r1.group_id = r2.group_id
AND r1.category_2 = r2.category_1
WHERE
r1.category_1 <> r2.category_2
AND r1.category_1 <> r2.category_1
AND
(r1.depth - 1) > 0
)
, matches AS (
SELECT
d.group_id
, d.category_1
, d.category_2
FROM match_detail d
GROUP BY d.group_id, d.category_1, d.category_2
)
, num_categories AS (
SELECT
u.group_id
, COUNT(DISTINCT u.category_2) AS num_categories
FROM categories u
GROUP BY u.group_id
)
, categories AS (
SELECT DISTINCT
u.group_id
, u.category_1
, u.category_2
FROM match_detail u
)
SELECT *
FROM related_cats