Question

我有一个带有分层代码列表变量的数据集。层次结构的逻辑由LEVEL变量和CODE字符变量的前缀结构决定。有6个（代码长度从1到6）“聚合”级别和终端级别（代码长度为10个字符）。

我需要更新节点变量（终端节点的数量 - 聚合级别不计入“更高”的聚合，只计算终端节点） - 所以一个级别的计数总和，例如每个级别5的总数count与每个6级相同。我需要计算（总结）权重到“更高”级别的节点。

注意：我偏移了输出表的NODES和WEIGHT变量，这样你就可以更好地看到我所说的内容（只需将每个偏移量中的数字加起来，你得到相同的值）。

EDIT1：可以使用相同的代码进行多次观察。独特的观察结果是3个变量代码+ var1 + var2的组合。

输入表：

ID   level code         var1  var2  nodes  weight  myIndex
1    1     1            .     .     999    999     999
2    2     11           .     .     999    999     999
3    3     111          .     .     999    999     999
4    4     1111         .     .     999    999     999
5    5     11111        .     .     999    999     999
6    6     111111       .     .     999    999     999
7   10     1111119999   01    1     1      0.1     105,5
8   10     1111119999   01    2     1      0.1     109,1
9    6     111112       .     .     999    999     999
10  10     1111120000   01    1     1      0.5      95,0
11   5     11119        .     .     999    999     999
12   6     111190       .     .     999    999     999
13  10     1111901000   01    1     1      0.1      80,7
14  10     1111901000   02    1     1      0.2     105,5

所需的输出表：

ID   level code         var1  var2  nodes    weight              myIndex
1    1     1            .     .     5        1.0                  98,1
2    2     11           .     .     5        1.0                  98,1
3    3     111          .     .     5        1.0                  98,1
4    4     1111         .     .     5        1.0                  98,1
5    5     11111        .     .       3          0.7              98,5
6    6     111111       .     .         2            0.2         107,3
7   10     1111119999   01    1           1               0.1    105,5  
8   10     1111119999   01    2           1               0.1    109,1
9    6     111112       .     .         1            0.5          95,0
10  10     1111120000   01    1           1               0.5     95,0
11   5     11119        .     .       2          0.3              97,2
12   6     111190       .     .         2            0.3          97,2
13  10     1111901000   01    1           1               0.1     80,7
14  10     1111901000   02    1           1               0.2    105,5

这是我提出的代码。它的工作原理就像我想要的那样，但男人，它真的很慢。我需要更快的东西，因为这是web服务的一部分，必须根据请求“立即”运行。欢迎任何关于加快代码或任何其他解决方案的建议。

%macro doit;

data temporary;
    set have;
run;

%do i=6 %to 2 %by -1;
    %if &i = 6 %then %let x = 10;
    %else %let x = (&i+1);

    proc sql noprint;
        select count(code)
        into :cc trimmed
        from have
        where level = &i;

        select code
        into :id1 - :id&cc
        from have
        where level = &i;
    quit;

    %do j=1 %to &cc.;

        %let idd = &&id&j;

        proc sql;
        update have t1
            set nodes = (
                       select sum(nodes)
                       from temporary t2
                       where t2.level = &x and t2.code like ("&idd" || "%")),
            set weight = (
                       select sum(weight)
                       from temporary t2
                       where t2.level = &x and t2.code like ("&idd" || "%"))   
            where (t1.level = &i and t1.code like "&idd");
        quit;
    %end;
%end;
%mend doit;

基于@ Quentin解决方案的当前代码：

data have;
input ID level code : $10. nodes weight myIndex;
cards;
1    1  1            .   .    .
2    2  11           .   .    .
3    3  111          .   .    .
4    4  1111         .   .    .
5    5  11111        .   .    .
6    6  111111       .   .    .
7   10  1111110000   1   0.1  105.5
8   10  1111119999   1   0.1  109.1
9    6  111112       .   .    .
10  10  1111129999   1   0.5  95.0
11   5  11119        .   .    .
12   6  111190       .   .    .
13  10  1111900000   1   0.1  80.7
14  10  1111901000   1   0.2  105.5
;

data want (drop=_:);

    *hash table of terminal nodes;
    if (_n_ = 1) then do;
        if (0) then set have (rename=(code=_code weight=_weight));
        declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
        declare hiter iter('h');
        h.definekey('ID');
        h.definedata('_code','_weight','_myIndex');
        h.definedone();
    end;

    set have;

    *for each non-terminal node, iterate through;
    *hash table of all terminal nodes, looking for children;
    if level ne 10 then do;
        call missing(weight, nodes, myIndex);

        do _n_ = iter.first() by 0 while (_n_ = 0);
            if trim(code) =: _code then do;  
                weight=sum(weight,_weight);
                nodes=sum(nodes,1);
                myIndex=sum(myIndex,_myIndex*_weight);
            end;
            _n_ = iter.next();
        end;
        myIndex=round(myIndex/weight,.1);
    end;
    output;
run;

Answer 1

下面是一个蛮力哈希方法，用于执行与SQL中类似的笛卡尔积。加载终端节点的哈希表。然后读取节点的数据集，并且对于不是终端节点的每个节点，遍历哈希表，识别所有子终端节点。

我认为@joop描述的方法可能更有效，因为这种方法没有利用树结构。所以有很多重新计算。有5000个记录和3000个终端节点，这将进行2000 * 3000比较。但是由于哈希表在内存中，所以可能不会那么慢，所以你不会有过多的I / O ....

data want (drop=_:);

   *hash table of terminal nodes;
   if (_n_ = 1) then do;
      if (0) then set have (rename=(code=_code weight=_weight));
      declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
      declare hiter iter('h');
      h.definekey('ID');
      h.definedata('_code','_weight');
      h.definedone();
   end;

   set have;

   *for each non-terminal node, iterate through;
   *hash table of all terminal nodes, looking for children;
   if level ne 10 then do;
      call missing(weight, nodes);

      do _n_ = iter.first() by 0 while (_n_ = 0);
         if trim(code) =: _code then do;  
           weight=sum(weight,_weight);
           nodes=sum(nodes,1);
         end;
         _n_ = iter.next();
      end;
   end;
   output;
run;

Answer 2

这是一种替代哈希方法。

不是使用哈希对象进行笛卡尔连接，而是添加节点和从每个级别10节点到6个适用的父节点中的每一个的权重。这可能比Quentin的方法略快，因为没有冗余的哈希查找。

在构造哈希对象时，它需要比Quentin的方法稍长一些，并且使用更多的内存，因为每个终端节点使用不同的密钥添加6次，并且现有的条目通常必须更新，但之后只有每个父节点必须查找自己的个人统计数据，而不是循环遍历所有终端节点，这是一个很大的节省。

加权统计数据也是可能的，但您必须更新两个循环，而不仅仅是第二个循环。

data want;
if 0 then set have;
dcl hash h();
h.definekey('code');
h.definedata('nodes','weight','myIndex');
h.definedone();
length t_code $10;
do until(eof);
  set have(where = (level = 10)) end = eof;
  t_nodes = nodes;
  t_weight = weight;
  t_myindex = weight * myIndex;
  do _n_ = 1 to 6;
    t_code = substr(code,1,_n_);
    if h.find(key:t_code) ne 0 then h.add(key:t_code,data:t_nodes,data:t_weight,data:t_myIndex);
    else do;
      nodes + t_nodes;
      weight + t_weight;
      myIndex + t_myIndex;
      h.replace(key:t_code,data:nodes,data:weight,data:MyIndex);
    end;
  end;
end;
do until(eof2);
  set have end = eof2;
  if level ne 10 then do;
    h.find();
    myIndex = round(MyIndex / Weight,0.1);
  end;
  output;
end;
drop t_:;
run;

Answer 3

看起来很简单。只需加入自己并计算/总和。

proc sql ;
create table want as
 select a.id, a.level, a.code , a.var1, a.var2
      , count(b.id) as nodes
      , sum(b.weight) as weight
 from have a
 left join have b
 on a.code eqt b.code
 and b.level=10
 group by 1,2,3,4,5
 order by 1
;
quit;

如果您不想使用EQT运算符，则可以使用SUBSTR（）函数。

 on a.code = substr(b.code,1,a.level)
 and b.level=10

Answer 4

由于您使用的是SAS，如何使用proc summary来完成繁重的工作？不需要笛卡尔连接！

此选项相对于其他一些选项的一个优点是，如果您想为多个变量计算大量更复杂的统计数据，则可以更容易概括。

data have;
input ID level code : $10. nodes weight myIndex;
format myIndex 5.1;
cards;
1    1  1            .   .    .
2    2  11           .   .    .
3    3  111          .   .    .
4    4  1111         .   .    .
5    5  11111        .   .    .
6    6  111111       .   .    .
7   10  1111110000   1   0.1  105.5
8   10  1111119999   1   0.1  109.1
9    6  111112       .   .    .
10  10  1111129999   1   0.5  95.0
11   5  11119        .   .    .
12   6  111190       .   .    .
13  10  1111900000   1   0.1  80.7
14  10  1111901000   1   0.2  105.5
;
run;


data v_have /view = v_have;
  set have(where = (level = 10));
  array lvl[6] $6;
  do i = 1 to 6;
    lvl[i]=substr(code,1,i);
  end;
  drop i;
run;

proc summary data = v_have;
  class lvl1-lvl6;
  var nodes weight;
  var myIndex /weight = weight;
  ways 1;
  output out = summary(drop = _:) sum(nodes weight)= mean(myIndex)=;
run;

data v_summary /view = v_summary;
  set summary;
  length code $10;
  code = cats(of lvl:);
  drop lvl:;
run;

data have;
  modify have v_summary;
  by code;
  replace;
run;

理论上，散列哈希也可能是一个合适的数据结构，但这对于相对较小的好处来说会非常复杂。无论如何，我可能只是一个学习练习......

Answer 5

一种方法（我认为）是制作笛卡尔积，并找到所有匹配的终端节点＆＃34;到每个节点，然后加权。

类似的东西：

data have;
  input ID level code : $10. nodes weight ;
  cards;
1    1  1            .   .
2    2  11           .   .
3    3  111          .   .
4    4  1111         .   .
5    5  11111        .   .
6    6  111111       .   .
7   10  1111110000   1   0.1
8   10  1111119999   1   0.1
9    6  111112       .   .
10  10  1111129999   1   0.5
11   5  11119        .   .
12   6  111190       .   .
13  10  1111900000   1   0.1
14  10  1111901000   1   0.2
;


proc sql;
  select min(id) as id
       , min(level) as level 
       , a.code
       , count(b.weight) as nodes   /*count of terminal nodes*/
       , sum(b.weight) as weight    /*sum of weights of terminal nodes*/
    from 
      have as a 
     ,(select code , weight
       from have
       where level=10   /*selects terminal nodes*/
       ) as b
    where a.code eqt b.code        /*EQT is equivalent to =: */
    group by a.code
  ;
quit;

我不确定这是否正确，但它会为样本数据提供所需的结果。

Answer 6

这是估计每条记录的父记录所需的SQL。它只使用字符串函数（位置和长度），因此它应该适用于SQL的任何方言，甚至可以适用于SAS。（可能需要将CTE重写为子查询或视图）这个想法是：

将parent_id字段添加到数据集
找到代码最长子串的记录
并使用其id作为parent_id
（之后）更新其总和（节点），直接子项（具有child.parent_id = this.id的那些）的总和（权重）的记录

顺便说一句：我本可以使用LEVEL代替LENGTH(code);这方面的数据有点多余。

WITH sub AS (
        SELECT id, length(code) AS len
        , code
        FROM tree)
UPDATE tree t
SET parent_id = s.id
FROM sub s
WHERE length(t.code) > s.len AND POSITION (s.code IN t.code) = 1
AND NOT EXISTS (
        SELECT *
        FROM sub nx
        WHERE nx.len > s.len AND POSITION (nx.code IN t.code ) = 1
        AND nx.len < length(t.code) AND POSITION (nx.code IN t.code ) = 1
        )
        ;

SELECT * FROM tree
ORDER BY parent_id DESC NULLS LAST
        , id
        ;

找到父母后，整个表格应该（自发地）更新像：

-- PREPARE omg( integer) AS
UPDATE tree  t
SET nodes = s.nodes ,  weight = s.weight
FROM ( SELECT parent_id , SUM(nodes) AS nodes , SUM(weight) AS weight
        FROM tree GROUP BY parent_id) s
WHERE s.parent_id = t.id
        ;

在SAS中，这可能是通过对{0-parent_id，id}进行排序并执行一些保留+求和魔法来完成的。（我的SAS在这方面有点生疏）

更新：如果只有叶节点具有{nodes，weight}的非NULL（非缺失）值，则可以在整个树的一次扫描中完成聚合，而无需先计算parent_id：

UPDATE tree  t
SET nodes = s.nodes ,  weight = s.weight
FROM ( SELECT p.id , SUM(c.nodes) AS nodes , SUM(c.weight) AS weight
        FROM tree p
        JOIN tree c ON c.lev > p.lev AND POSITION (p.code IN c.code ) = 1
        GROUP BY p.id
        ) s
WHERE s.id = t.id
        ;

{lev,code}上的索引可能会加快速度。（假设id为索引）

SAS层次结构和

6 个答案: