Question

尝试使用SAS（在我所知的情况下不支持WITH RECURSIVE）在SQL中执行经典的层次结构树。

这是现有表格中的简化数据结构：

|USER_ID|SUPERVISOR_ID|

因此，要构建层次结构，您只需递归加入x次以获取您要查找的数据，其中SUPERVISOR_ID = USER_ID。在我的公司，它是16级。

尝试让每个用户终止分支时出现此问题。例如，让我们考虑级别1的用户A在级别2下具有用户B，C，D和E.因此，使用递归LEFT JOIN，您将获得：

| -- Level 1 -- | -- Level 2 -- |
     User A          User B
     User A          User C
     User A          User D
     User A          User E

问题是，用户A没有自己的终止分支。最终结果是：

| -- Level 1 -- | -- Level 2 -- |
     User A           NULL         
     User A          User B
     User A          User C
     User A          User D
     User A          User E

我的第一个脸红想法是我可以通过在每个级别创建一个临时表然后在结果上执行UNION ALL来解决这个问题，但是考虑到大小（16级）这似乎非常低效，我希望我失踪这里的东西是一个更清洁的解决方案。

Answer 1

我不太确定我理解这个问题，但是如果你想要生成每个主管下所有员工的完整列表，那么这是一种方法，假设每个员工都有一个唯一的ID，可以出现在用户或主管列中：

data employees;
input SUPERVISOR_ID USER_ID;
cards;
1 2
1 3
1 4
2 5
2 6
2 7
7 8
;
run;

proc sql;
  create view distinct_employees as 
  select distinct SUPERVISOR_ID as USER_ID from employees
  union
  select distinct USER_ID from employees;
quit;

data hierarchy;
  if 0 then set employees;
  set distinct_employees;
  if _n_ = 1 then do;
    declare hash h(dataset:'employees');
    rc = h.definekey('USER_ID');
    rc = h.definedata('SUPERVISOR_ID');
    rc = h.definedone();
  end;
  T_USER_ID = USER_ID;
  do while(h.find() = 0);
    USER_ID = T_USER_ID;
    output;
    USER_ID = SUPERVISOR_ID;
  end;
  drop rc T_USER_ID;
run;

proc sort data = hierarchy;
  by SUPERVISOR_ID USER_ID;
run;

Answer 2

考虑一些简单的过程P，它从一组（super_id，user_id）创建可能路径的矩形。

长度为N的路径是N级深度并且链接（N-1）个关系。

每个级别的值是否与该级别不同？

否？与实际路径相比，P将找到循环，交叉路径和环绕路径。环绕是当实际路径级别的节点> 1'被'找到'是一个等级= 1的节点。
是？ P将找到路径，交叉路径和环绕路径。其他数据限制或规则可以帮助消除

考虑具有模糊等级值的4条简单路径：

data path(keep=L1-L4) rels(keep=super_id user_id);
  array L(4);
  input L(*);
  output path;
  super_id = L(1);
  do i = 2 to dim(L);
    user_id = L(i);
    output rels;
    super_id = user_id;
  end;
datalines;
1 3 1 4
1 5 1 4
2 3 2 3
1 2 3 4
run;

只有12条关系数据。这些对生存的路径和它们存在的水平都不为人知：

用于在关系中组装4级路径的明确的2阶段查询。如果代码有效，则可以将其抽象为宏编码。

proc sql;

  * RELS cross RELS, extensive i/o;
  * get on the induction ladder;

  create table ITER_1 as
  select distinct
    S.super_id as L3 /* parent^2 */
  , S.user_id as L2 /* parent */ 
  , U.user_id as L1 /* leaf */
  from RELS U
  cross join RELS S 
  where S.user_id = U.super_id
  order by L3, L2, L1
  ;

  * ITER_1 cross RELS, little less extensive i/o;
  * if you see the inductive variation you can macroize it;

  create table ITER_2 as
  select distinct
    S.super_id as L4 /* parent^3 */
  , U.L3 /* parent^2 */
  , U.L2 /* parent */
  , U.L1 /* leaf */
  from ITER_1 U
  cross join RELS S
  where S.user_id = U.L3
  order by L4, L3, L2, L1
  ;
quit;

上述汇编程序没有对身份知识，不能限制离散对的路径。所以会有周期，交叉和包装。

找到路径（一些解释）

 1 : 1 2 3 1   path 4 L3 xover to path 1 L2
 2 : 1 2 3 2   path 4 L3 xover to path 3 L2
 3 : 1 2 3 4   actual
 4 : 1 3 1 2   path 1 L3 xover to path 4 L1
 5 : 1 3 1 3
 6 : 1 3 1 4   actual
 7 : 1 3 1 5
 8 : 1 3 2 3
 9 : 1 5 1 2
10 : 1 5 1 3
11 : 1 5 1 4   actual
12 : 1 5 1 5
13 : 2 3 1 2
14 : 2 3 1 3
15 : 2 3 1 4
16 : 2 3 1 5
17 : 2 3 2 3   actual is actually a cycler too
18 : 3 1 2 3
19 : 3 1 3 1
20 : 3 1 3 2
21 : 3 1 3 4
22 : 3 1 5 1
23 : 3 2 3 1
24 : 3 2 3 2
25 : 3 2 3 4
26 : 5 1 2 3
27 : 5 1 3 1
28 : 5 1 3 2
29 : 5 1 3 4
30 : 5 1 5 1   path 2 L3 cycled to path 2 L1

如果在任何其他级别中找不到每个关系级别的ID，则隐式消除周期。由于没有路径标识信息，因此无法消除交叉。环绕也一样。

更复杂的SQL可以确保找到的“路径”中的每个关系只出现一次，路径的内容出现在不同的位置。根据实际数据，您可能仍会有大量错误路径。

高度常规的代码适合宏观化，但实际的SQL运行时间高度依赖于实际数据和RELs数据集索引。

proc sql;

create table ITER_1 as
select 
  L3 /* parent^2 */
, L2 /* parent */ 
, L1 /* leaf */
, R1
, R2
from 
(
  select distinct
    S.super_id as L3 /* parent^2 */
  , S.user_id as L2 /* parent */ 
  , U.user_id as L1 /* leaf */
  , U.row_id as R1
  , S.row_id as R2
  , monotonic() as seq
  from RELS U
  cross join RELS S 
  where S.user_id = U.super_id
    and S.row_id < U.row_id  /* triangular constraint allowed due to symmetry */
)
group by L3, L2, L1
having seq = min(seq)
order by L3, L2, L1
;

create table ITER_2 as
select
  L4 /* parent^3 */ format=6.
, L3 /* parent^2 */ format=6.
, L2 /* parent */ format=6.
, L1 /* leaf */ format=6.
, R1 format=6.
, R2 format=6.
, R3 format=6.
from
(
  select distinct
    S.super_id as L4 /* parent^3 */ format=6.
  , U.L3 /* parent^2 */ format=6.
  , U.L2 /* parent */ format=6.
  , U.L1 /* leaf */ format=6.
  , U.R1 format=6.
  , U.R2 format=6.
  , S.row_id as R3 format=6.
  , monotonic() as seq
  from ITER_1 U
  cross join RELS S
  where S.user_id = U.L3
    and S.row_id ne R1
    and S.row_id ne R2
)
group by L4, L3, L2, L1
having seq = min(seq)
order by L4, L3, L2, L1
;

退出;

对NULL项的最后调整需要更多的SQL。

是否可以在不需要NULL的情况下处理发现的层次结构？具有BY处理的DATA Step SET可以使用LAST检测级别的结束。

SQL - 每级记录的递归树层次结构

2 个答案: