Question

我有一组由type组织的连续整数，位于table1。所有值均在1和10之间，包括在内。

table1:
row_id  set_id  type    min_value   max_value
1       1       a       1           3
2       2       a       4           10
3       3       a       6           10
4       4       a       2           5
5       5       b       1           9
6       6       c       1           7
7       7       c       3           10
8       8       d       1           2
9       9       d       3           3
10      10      d       4           5
11      11      d       7           10

在table2中，在每个type内，我想组合所有可能的最大，非重叠集合（尽管不能是由任何set的正确type填充都可以。期望的输出：

table2:
row_id  type    group_id    set_id
1       a       1           1
2       a       1           2
3       a       2           1
4       a       2           3
5       a       3           3
6       a       3           4
7       b       4           5
8       c       5           6
9       c       6           7
10      d       7           8
11      d       7           9
12      d       7           10
13      d       7           11

我目前的想法是使用可能值有限的事实。步骤进行：

查找包含值table1的{{1}}中的所有集。将它们复制到1。
查找table2中包含值table1但未包含在2中的所有集。
在[{1}}，table2上加入{2}中的table1集合，type大于set_id的{{1}} min_value。
对于未加入（3）的（2）中的集合，将它们插入group。这些会启动新的max_value，可能会在以后延长。
对table2 group到value重复步骤（2）到（4）。

我认为这会有效，但它有很多痛苦的步骤，特别是对于（2） - 找不到3中的集合，以及（4） - 找到没有加入的套装。

您知道更快，更有效的方法吗？我的真实数据有数百万10 s，数千table2 s和数百set s（幸运的是，在示例中，type是有界的），因此可扩展性至关重要。

我正在使用PLSQL Developer和Oracle 10g（不是我之前说过的11g - 谢谢，IT部门）。谢谢！

Answer 1

如果您可以识别所有组及其起始set_id，那么您可以使用递归方法并在单个语句中完成所有操作，而不是需要迭代填充表。但是，您需要对速度/效率和资源消耗两种方法进行基准测试 - 无论是扩展数据量还是系统内部，都需要验证可用资源。

如果我了解您何时决定开始新的群组，则可以使用以下查询一次识别所有群组：

with t as (
  select t1.type, t1.set_id, t1.min_value, t1.max_value,
    t2.set_id as next_set_id, t2.min_value as next_min_value,
    t2.max_value as next_max_value
  from table1 t1
  left join table1 t2 on t2.type = t1.type and t2.min_value > t1.max_value
  where not exists (
    select 1
    from table1 t3
    where t3.type = t1.type
    and t3.max_value < t1.min_value
  )
)
select t.type, t.set_id, t.min_value, t.max_value,
  t.next_set_id, t.next_min_value, t.next_max_value,
  row_number() over (order by t.type, t.min_value, t.next_min_value) as grp_id
from t
where not exists (
  select 1 from t t2
  where t2.type = t.type
  and t2.next_max_value < t.next_min_value
)
order by grp_id;

这里很棘手的一点是获取a的所有三个组，特别是以set_id = 1开头的两个组，但d只有一个组。内部选择（在CTE中）通过not exists子句查找不具有较低非重叠范围的集合，并通过外部联接到同一个表以获取下一组（不要重叠，这会为您提供两个以set_id = 1开头的群组，还有四个以set_id = 9开头的群组。外部选择然后忽略除最低的非重叠与第二个not exists子句之外的所有内容 - 但不必再次击中真实表。

这样就可以了：

TYPE SET_ID  MIN_VALUE  MAX_VALUE NEXT_SET_ID NEXT_MIN_VALUE NEXT_MAX_VALUE GRP_ID
---- ------ ---------- ---------- ----------- -------------- -------------- ------
a         1          1          3           2              4             10      1 
a         1          1          3           3              6             10      2 
a         4          2          5           3              6             10      3 
b         5          1          9                                                4 
c         6          1          7                                                5 
c         7          3         10                                                6 
d         8          1          2           9              3              3      7

然后，您可以将其用作recursive subquery factoring clause中的锚点成员：

with t as (
  ...
),
r (type, set_id, min_value, max_value,
    next_set_id, next_min_value, next_max_value, grp_id) as (
  select t.type, t.set_id, t.min_value, t.max_value,
    t.next_set_id, t.next_min_value, t.next_max_value,
    row_number() over (order by t.type, t.min_value, t.next_min_value)
  from t
  where not exists (
    select 1 from t t2
    where t2.type = t.type
    and t2.next_max_value < t.next_min_value
  )
  ...

如果您离开了r CTE并且只是sleect * from r，那么您将获得相同的七个组。

递归成员然后使用 next set_id及其查询范围作为每个组的下一个成员，并重复外连接/不存在查找以查找下一组再次;当没有下一个非重叠集时停止：

  ...
  union all
  select r.type, r.next_set_id, r.next_min_value, r.next_max_value,
    t.set_id, t.min_value, t.max_value, r.grp_id
  from r
  left join table1 t
  on t.type = r.type
  and t.min_value > r.next_max_value
  and not exists (
    select 1 from table1 t2
    where t2.type = r.type
    and t2.min_value > r.next_max_value
    and t2.max_value < t.min_value
  )
  where r.next_set_id is not null -- to stop looking when you reach a leaf node
)
...

最后，你有一个基于递归CTE的查询来获取你想要的列并指定顺序：

...
select r.type, r.grp_id, r.set_id
from r
order by r.type, r.grp_id, r.min_value;

获得：

TYPE     GRP_ID     SET_ID
---- ---------- ----------
a             1          1 
a             1          2 
a             2          1 
a             2          3 
a             3          4 
a             3          3 
b             4          5 
c             5          6 
c             6          7 
d             7          8 
d             7          9 
d             7         10 
d             7         11

SQL Fiddle demo

如果您愿意，可以显示每组的最小/最大值，并可以跟踪并显示每组的最小/最大值。我现在只是显示问题中的列。

Answer 2

对于Oracle 10g，您不能使用递归CTE，但通过一些工作，您可以使用connect by语法执行类似操作。首先，您需要生成一个CTE或内联视图，其中包含所有非重叠链接，您可以这样做：

select t1.type, t1.set_id, t1.min_value, t1.max_value,
  t2.set_id as next_set_id, t2.min_value as next_min_value,
  t2.max_value as next_max_value,
  row_number() over (order by t1.type, t1.set_id, t2.set_id) as group_id
from table1 t1
left join table1 t2 on t2.type = t1.type
and t2.min_value > t1.max_value
where not exists (
  select 1
  from table1 t4
  where t4.type = t1.type
  and t4.min_value > t1.max_value
  and t4.max_value < t2.min_value
)
order by t1.type, group_id, t1.set_id, t2.set_id;

这需要一些实验，我当然可能错过或丢失了有关规则的过程;但是这会给你12个伪行，并且在我之前的答案中，这允许跟随以a/1开头的两个单独的链，同时将d值约束到单个链：

TYPE SET_ID  MIN_VALUE  MAX_VALUE NEXT_SET_ID NEXT_MIN_VALUE NEXT_MAX_VALUE GROUP_ID
---- ------ ---------- ---------- ----------- -------------- -------------- --------
a         1          1          3           2              4             10        1 
a         1          1          3           3              6             10        2 
a         2          4         10                                                  3 
a         3          6         10                                                  4 
a         4          2          5           3              6             10        5 
b         5          1          9                                                  6 
c         6          1          7                                                  7 
c         7          3         10                                                  8 
d         8          1          2           9              3              3        9 
d         9          3          3          10              4              5       10 
d        10          4          5          11              7             10       11 
d        11          7         10                                                 12

这可以用作CTE;通过连接循环查询：

with t as (
   ... -- same as above query
)
select t1.type,
  dense_rank() over (partition by null
    order by connect_by_root group_id) as group_id,
  t1.set_id
from t t1
connect by type = prior type
and set_id = prior next_set_id
start with not exists (
  select 1 from table1 t2
  where t2.type = t1.type
  and t2.max_value < t1.min_value
)
and not exists (
  select 1 from t t3
  where t3.type = t1.type
  and t3.next_max_value < t1.next_min_value
)
order by t1.type, group_id, t1.min_value;

dense_rank()使组ID连续;不确定你是否真的需要那些，或者他们的序列是否重要，所以它真的是可选的。 connect_by_root为链的开头提供了组ID，因此虽然初始查询中有12行和12 group_id个值，但它们并不会出现在最终结果中。

连接是通过两个prior值，类型和初始查询中找到的下一个设置ID。这会产生所有的链条，但拥有自己的链条也会包括更短的链条 - 对于d您8,9,10,11以及9,10,11和10,11，您不需要想要成为独立的团体。那些被start with条件消除了，这可能会被简化。

这给出了：

TYPE GROUP_ID SET_ID
---- -------- ------
a           1      1 
a           1      2 
a           2      1 
a           2      3 
a           3      4 
a           3      3 
b           4      5 
c           5      6 
c           6      7 
d           7      8 
d           7      9 
d           7     10 
d           7     11

SQL Fiddle demo

SQL：组装非重叠集

2 个答案: