连接组件

时间:2016-10-23 15:15:54

标签: sql teradata ansi-sql

我有一组数据是通过将类似的子项匹配在一起创建的,然后按"类别"对这些类似的项目进行分组。

现在,必须以这样的方式匹配结果类别,即在每个" group_id"中将相关类别组合在一起。在下面的示例中,一个匹配是A-> B-> C-> D-> E-> F-> G,其通过递归行获得。

我发布了我的current answer,它适用于这个简单的数据集,但因为实际的数据集最多包含1M行,每个" group_id最多可能有60个类别"此查询导致"脱离假脱机空间"真实数据错误。

请注意:

  • 由于公司限制,我无法使用存储过程。
  • 我无法使用用户定义的函数(UDF)
  • 我无法使用用户定义的类型(UDT)

正确答案

  • 为Teradata或兼容
  • 编写
  • 比答案更有效率
  • 尊重我上面提到的限制
  

示例输入:

     

Sample Input

     

期望的输出:

     

Desired Output

3 个答案:

答案 0 :(得分:1)

你需要一个递归的apporach,但你的WITH RECURSIVE会产生一个巨大的中间结果,导致不再是假脱机

对于类似的过程,我使用了以下方法(最初在存储过程中使用WHILE-loop):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

现在重复以下更新,直到0 rows processed

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

要获得您最终需要的相关类别:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

要与预期结果完全匹配,您需要添加DENSE_RANK

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt

答案 1 :(得分:1)

这种解决方案的概念是在穿过边缘时避免循环 它是通过在运行中计算路径的位图并避免添加其category_2已经位于位图中的边来完成的。
所有路径都以自引用边(例如'A' - 'A')开始,以便获取第一个节点位图。
此解决方案将迭代子查询( edge )的结果记录数减少到70.

此解决方案的局限性在于应预先定义位图大小 varbinary文字中的每个数字代表4位  64位='0000000000000000'xb
128位='00000000000000000000000000000000'xb

with        category_group_bitmap (group_id,category_1,bitflag_group)
            as 
            (
                select      group_id
                           ,category_1

                           ,dense_rank () over 
                            (
                                partition by    group_id 
                                order by        sum (distinct category_2_n)
                            ) as bitflag_group

                from        edges

                group by    group_id
                           ,category_1
            )  

           ,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
            as
            (
                select      1                                           as n
                           ,m.group_id
                           ,m.category_1
                           ,m.category_2
                           ,gc.categories_num
                           ,setbit ('0000000000000000'xb,m.category_2_n)  as bitmap
                           ,m.category_2_n

                from                    match_detail_category_2_n   as m

                            join        group_categories_num        as gc

                            on          gc.group_id =
                                        m.group_id

                where       m.category_1    =
                            m.category_2

                union all

                select      e.n + 1                             as n
                           ,e.group_id
                           ,e.category_1
                           ,m.category_2
                           ,e.categories_num
                           ,setbit (e.bitmap,m.category_2_n)    as bitmap
                           ,m.category_2_n

                from                    edges                       as e

                            join        match_detail_category_2_n   as m

                            on          m.group_id  =
                                        e.group_id

                                    and m.category_1    =
                                        e.category_2

                where       e.n <   e.categories_num - 1
                        and getbit (e.bitmap,m.category_2_n) = 0
            )

           ,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
            as 
            (   
                select      m.group_id
                           ,category_1
                           ,category_2

                           ,cast 
                            (
                                dense_rank () over 
                                (
                                    partition by    group_id 
                                    order by        category_2
                                ) - 1 

                                as byteint
                            )

                from        match_detail as m
            )

           ,group_categories_num (group_id,categories_num)
            as 
            (   
                select      group_id
                           ,count (distinct category_1)

                from        match_detail

                group by    group_id
            )

select      *

from        category_group_bitmap
;

答案 2 :(得分:0)

这样可行,但会造成“假脱机空间”#34;真实数据问题。

架构创建:

CREATE VOLATILE TABLE match_detail (
    group_id bigint
    , category_1 varchar(255)
    , category_2 varchar(255)
) PRIMARY INDEX (group_id)
    ON COMMIT PRESERVE ROWS;

INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');

查询:

WITH

 related_cats AS (
    SELECT
        group_id
        , category_1
        , SUM(DISTINCT bitflag) As bitflag_total
        , DENSE_RANK() OVER (
            PARTITION BY
                group_id
            ORDER BY
                group_id
                , bitflag_total
        ) As bitflag_group

    FROM bitflags

    GROUP BY 1, 2
)

, bitflags As (
    SELECT
        DISTINCT
            group_id
            , category_1
            , category_2
            , CAST
            (
                2 ** (DENSE_RANK() OVER (
                    PARTITION BY
                        group_id

                    ORDER BY group_id
                        , category_2) - 1)

                As bigint
            ) As bitflag

    FROM cat_join
    WHERE depth = 1
)

, RECURSIVE cat_join AS (
    SELECT DISTINCT
        c1.group_id
        , c1.category_1
        , c1.category_2
        , CAST
        (
            n.num_categories - 1
            As integer
        ) As max_depth
        , CASE
            WHEN c1.category_1 = c1.category_2 THEN 1
            ELSE max_depth
        END As depth
        , 1 As recursion

    FROM matches c1
    INNER JOIN num_categories n
        ON c1.group_id = n.group_id

    UNION ALL

    SELECT
        r1.group_id
        , r1.category_1
        , r2.category_2
        , r1.max_depth
        , CASE
            WHEN r1.category_1 = r1.category_2 THEN 1
            WHEN r1.category_1 = r2.category_2 THEN 1
            ELSE r1.depth - 1
        END As cur_depth
        , recursion + 1

        FROM cat_join r1
        INNER JOIN matches r2
            ON r1.group_id = r2.group_id
            AND r1.category_2 = r2.category_1

        WHERE
        r1.category_1 <> r2.category_2
        AND r1.category_1 <> r2.category_1
        AND
            (r1.depth - 1) > 0
)

, matches AS (
    SELECT
        d.group_id
        , d.category_1
        , d.category_2

    FROM match_detail d
    GROUP BY d.group_id, d.category_1, d.category_2
)

, num_categories AS (
    SELECT
        u.group_id
        , COUNT(DISTINCT u.category_2) AS num_categories

    FROM categories u
    GROUP BY u.group_id
)

, categories AS (
    SELECT DISTINCT
        u.group_id
        , u.category_1
        , u.category_2

    FROM match_detail u
)

SELECT *
FROM related_cats