从链接的行中提取行族

时间:2015-12-03 21:31:15

标签: sql plsql self-join listagg

我有一个类似于下表的链接交易表

+----+----+----+
| #  | A  | B  |
+----+----+----+
| 1  | 1  | 4  |
| 2  | 3  | 5  |
| 3  | 4  | 6  |
| 4  | 5  | 8  |
| 5  | 6  | 1  |
| 6  | 7  | 7  |
| 7  | 8  | 3  |
| 8  | 9  | 3  |
| 9  | 10 | 4  |
| 10 | 11 | 14 |
| 11 | 2  | 2  |
| 12 | 12 | 4  |
| 13 | 13 | 14 |
| 14 | 14 | 9  |
| 15 | 15 | 1  |
+----+----+----+

A列和B列下的数字代表交易ID。因此,例如,交易1通过某些标准与交易4相关联,tran 3与tran 5相关联,tran 4与tran 6相关联等等。

交易2和7未与任何其他交易相关联,因此它们是自我链接的。

我想要提取的是此表中的交易系列 - 由于tran 1和4是链接的,tran 4和6是链接的,tran 10和4是链接的等等它们属于一个交易家族 - (1,4,6 ,10,12,15)。

我想创建具有最低事务ID的事务系列,即主事务。 理想情况下,输出看起来像这样

+----+------+--------------+
| #  | Tran | Master_tran  |
+----+------+--------------+
| 1  | 1    | 1  |
| 2  | 3    | 3  |         
| 3  | 4    | 1  |
| 4  | 5    | 3  |
| 5  | 6    | 1  |
| 6  | 7    | 7  |
| 7  | 8    | 3  |
| 8  | 9    | 3  |
| 9  | 10   | 1  |
| 10 | 11   | 3  |
| 11 | 2    | 2  |
| 12 | 12   | 1  |
| 13 | 13   | 3  |
| 14 | 14   | 3  |
| 15 | 15   | 1  |
+----+------+----+

我一直在玩自我加入。

SELECT     t1.a as x, 
           least (min(t1.b), min(t2.a)) as y  
FROM       test   t1 
LEFT JOIN  test   t2 on t2.b = t1.a  
GROUP BY   t1.a 
ORDER BY   t1.a asc

此代码提供以下outupt

+------+----+---+
| Col1 | X  | Y |
+------+----+---+
|    1 |  1 | 4 |
|    2 |  2 | 2 |
|    3 |  3 | 5 |
|    4 |  4 | 1 |
|    5 |  5 | 3 |
|    6 |  6 | 1 |
|    7 |  7 | 7 |
|    8 |  8 | 3 |
|    9 |  9 | 3 |
|   10 | 10 |   |
|   11 | 11 |   |
|   12 | 12 |   |
|   13 | 13 |   |
|   14 | 14 | 9 |
|   15 | 15 |   |
+------+----+---+

我不确定我的代码有什么问题。有人能指出我正确的方向吗? 谢谢!

2 个答案:

答案 0 :(得分:0)

原则上你需要一个CONNECT BY语句来解决这样的分层问题。 虽然你有循环循环,你还需要一个NOCYCLE子句,这将消除循环中的最后一个链接,这很好,因为该链接永远不会成为答案的一部分。 你也有两个方向的链接(f.e.(13,14)和(14,9)),所以你必须小心将它包含在你的查询中(两次!)。

WITH t_order
     AS (SELECT qt.qt_id, qt.qt_a, qt.qt_b, LEAST( qt.qt_a, qt.qt_b ) AS t_parent, GREATEST( qt.qt_a, qt.qt_b ) AS t_child
       FROM query_test qt
     UNION
     SELECT qb.qt_id, qb.qt_a, qb.qt_b, GREATEST( qb.qt_a, qb.qt_b ) AS t_parent, LEAST( qb.qt_a, qb.qt_b ) AS t_child
       FROM query_test qb)
, hier
  AS (SELECT     ps.qt_id
              , ps.qt_a
              , ps.qt_b
              , t_parent
              , t_child
              , LEVEL
              , CONNECT_BY_ROOT t_parent AS prev_tran
           FROM t_order ps
     CONNECT BY NOCYCLE PRIOR t_child = t_parent)
SELECT   hr.qt_id, hr.qt_a, MIN( hr.prev_tran ) AS master_tran
  FROM hier hr
GROUP BY hr.qt_id, hr.qt_a
ORDER BY hr.qt_id, hr.qt_a;

这将解决您的问题,但如果必须处理这些100.000记录,则可能会变得非常慢。如果您需要将此方法与许多其他列组合,那么SQL语句也很难理解。为此,您应该将所有qt.qt列分解出来并在最后一次选择中加入它们。

WITH t_order
     AS (SELECT DISTINCT tran, root_tran
           FROM (SELECT LEAST( qt.qt_a, qt.qt_b ) AS tran, GREATEST( qt.qt_a, qt.qt_b ) AS root_tran
                   FROM query_test qt
                 UNION
                 SELECT GREATEST( qb.qt_a, qb.qt_b ) AS tran, LEAST( qb.qt_a, qb.qt_b ) AS root_tran
                   FROM query_test qb))
   , hier
     AS (SELECT DISTINCT tran, root_tran
           FROM (SELECT     tran, CONNECT_BY_ROOT root_tran AS root_tran
                       FROM t_order
                 CONNECT BY NOCYCLE PRIOR tran = root_tran)
          WHERE tran >= root_tran)
SELECT   qt.qt_id
       , qt.qt_a
       , MIN( LEAST( h1.root_tran, h2.root_tran ) ) AS master_tran
    FROM query_test qt
         INNER JOIN hier h1 ON qt.qt_a = h1.tran
         INNER JOIN hier h2 ON qt.qt_b = h2.tran
GROUP BY qt.qt_id, qt.qt_a
ORDER BY qt.qt_id, qt.qt_a;

我无法测试最后一句话。

答案 1 :(得分:0)

我可能已经创造了其他解决方案 您也可以将链接加倍,而不是使用CONNECT BY语句,并在需要时随时加倍。 检索所有链接的查询保持不变,但后面跟着一个简单的查询,用两个链接的所有不同组合替换原始链接。
包括由tran_a和tran_b组成的链接,您有2 + 1 + 2个链接,因此您可以找到最多5个链接的路径。 如果这是短的,你在前一个子查询下插入一个相同的子查询,现在它是4 + 1 + 4使9个链接长。 如您所见,每个添加的子查询的最大路径长度加倍,而性能成本仅略高。

首先查询您的演示数据:

WITH double_0
     AS (SELECT DISTINCT root_tran, tran
           FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
                       , GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
                    FROM tran_demo td_0
                  UNION
                  SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
                       , LEAST( qb.tran_a, qb.tran_b ) AS tran
                    FROM tran_demo qb ))
   , double_1
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT   td_1.td_id
       , td_1.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_1
         INNER JOIN double_1 d1 ON td_1.tran_a = d1.tran
         INNER JOIN double_1 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;

然后你如何修改:
请注意,您现在在最终查询中查询 double_2

WITH double_0
     AS (SELECT DISTINCT root_tran, tran
           FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
                       , GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
                    FROM tran_demo td_0
                  UNION
                  SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
                       , LEAST( qb.tran_a, qb.tran_b ) AS tran
                    FROM tran_demo qb ))
   , double_1
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
   , double_2
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT   td_1.td_id
       , td_1.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_1
         INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
         INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.td_id, td_1.tran_a
ORDER BY td_1.td_id, td_1.tran_a;

最后一个查询来检查您使用的路径长度是否仍然足够: 您已添加下一级别并减去当前级别 只要此查询没有返回任何行,当前查询就是正确的。

WITH double_0
     AS (SELECT DISTINCT root_tran, tran
           FROM ( SELECT LEAST( td_0.tran_a, td_0.tran_b ) AS root_tran
                       , GREATEST( td_0.tran_a, td_0.tran_b ) AS tran
                    FROM tran_demo td_0
                  UNION
                  SELECT GREATEST( qb.tran_a, qb.tran_b ) AS root_tran
                       , LEAST( qb.tran_a, qb.tran_b ) AS tran
                    FROM tran_demo qb ))
   , double_1
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_0 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
   , double_2
     AS (SELECT DISTINCT oa.root_tran, ob.tran
           FROM double_1 oa INNER JOIN double_0 ob ON oa.tran = ob.root_tran)
SELECT   td_1.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_1
         INNER JOIN double_2 d1 ON td_1.tran_a = d1.tran
         INNER JOIN double_2 d2 ON td_1.tran_b = d2.tran
GROUP BY td_1.tran_a
MINUS
SELECT   td_2.tran_a
       , MIN( LEAST( d1.root_tran, d2.root_tran ) ) AS master_tran
    FROM tran_demo td_2
         INNER JOIN double_1 d1 ON td_2.tran_a = d1.tran
         INNER JOIN double_1 d2 ON td_2.tran_b = d2.tran
GROUP BY td_2.tran_a
ORDER BY tran_a;

性能测试你必须自己做。 我很乐观,而子查询很便宜,每次有效路径长度加倍。 迟早这应该比以前的解决方案更快。
顺便说一下,关于对原始链接进行排序的说法也适用于此! 如果有效,请标记我的答案。