在条件下消除观察对,观察可以有多个可能的伴侣观察

时间:2013-10-16 14:34:46

标签: algorithm sas matching

在我目前的项目中,有几次我们必须根据不同的条件实施匹配。首先是对问题的更详细描述。

我们得到了一个表测试:
关键值
1 10
1 -10
1 10
1 20
1 -10
1 10
2 10
2-10

现在我们要应用一个规则,这样就可以消除一个组(由key的值定义)和0之和。

预期结果将是:
关键值
1 10
1 20

排序顺序无关紧要。

以下代码是我们解决方案的一个示例。 我们希望用my_id 2和7消除观察结果,并在3个观察结果中加上2个额外的数量为10.

data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;

/* get all possible matches represented by pairs of my_id */
proc sql noprint;
  create table zwischen_erg as
  select a.my_id as a_id,
         b.my_id as b_id
  from test as a inner join
       test as b on (a.alias=b.alias) 
  where a.amount=-b.amount;
quit;

/* select ids of matches to eliminate */
proc sort data=zwischen_erg ;
  by a_id b_id;
run;

data zwischen_erg1;
  set zwischen_erg;
  by a_id;

  if first.a_id then tmp_id1 = 0;
  tmp_id1 +1;
run;


proc sort data=zwischen_erg;
  by b_id a_id;
run;

data zwischen_erg2;
  set zwischen_erg;
  by b_id;

  if first.b_id then tmp_id2 = 0;
  tmp_id2 +1;
run;

proc sql;
  create table delete_ids as 
  select zwischen_erg1.a_id as my_id
  from zwischen_erg1 as erg1 left join 
       zwischen_erg2 as erg2 on 
                   (erg1.a_id = erg2.a_id and 
                    erg1.b_id = erg2.b_id)
  where tmp_id1 = tmp_id2
;
quit;

/* use delete_ids as filter */
proc sql noprint;
  create table erg as
  select a.*
  from test as a left join
       delete_ids as b on (a.my_id = b.my_id) 
  where b.my_id=.;
quit;

该算法似乎有效,至少没有人发现导致错误的输入数据。 但没有人可以向我解释为什么它有效,我不明白它是如何工作的。

所以我有几个问题。

  1. 对于所有可能的输入数据组合,此算法是否以正确的方式消除对?
  2. 如果它确实工作正常,算法如何详细工作?特别是部分
    其中tmp_id1 = tmp_id2。
  3. 是否有更好的算法来消除相应的对?
  4. 提前致谢并快乐编码 迈克尔

2 个答案:

答案 0 :(得分:1)

作为对第三个问题的回答。以下方法对我来说似乎更简单。 而且可能性能更高。 (因为我没有加入)

/*For every (absolute) value, find how many more positive/negative occurrences we have per key*/
proc sql;
    create view V_INTERMEDIATE_VIEW as
    select key, abs(Value) as Value_abs, sum(sign(value)) as balance
    from INPUT_DATA
    group by key, Value_abs
    ;
quit;

*The balance variable here means how many times more often did we see the positive than the negative of this value. I.e., how many of either the positive or the negative were we not able to eliminate;

/*Now output*/
data OUTPUT_DATA (keep=key Value);
    set V_INTERMEDIATE_VIEW;
    Value = sign(balance)*Value_abs; *Put the correct value back;

    do i=1 to abs(balance) by 1;
        output;
    end;
run;




如果你只想要纯SAS(所以没有proc sql),你可以这样做。请注意,它背后的想法保持不变。

data V_INTERMEDIATE_VIEW /view=V_INTERMEDIATE_VIEW;
    set INPUT_DATA;
    value_abs = abs(value);
run;
proc sort data=V_INTERMEDIATE_VIEW out=INTERMEDIATE_DATA;
    by key value_abs; *we will encounter the negatives of each value and then the positives;
run;

data OUTPUT_DATA (keep=key value);
    set INTERMEDIATE_DATA;
    by key value_abs;

    retain balance 0;
    balance = sum(balance,sign(value));

    if last.value_abs then do;
        value = sign(balance)*value_abs; *set sign depending on what we have in excess;            
        do i=1 to abs(balance) by 1;
            output;
        end;

        balance=0; *reset balance for next value_abs;
    end;
run;

注意:感谢Joe提供了一些有用的性能建议。

答案 1 :(得分:0)

快速阅读后我没有看到任何错误。但是“zwischen_erg”可能会有很多不必要的多对多匹配,效率低下。

这似乎有效(但不保证),可能更有效率。也更短,所以也许更容易看到发生了什么。

data test;
input my_id alias $ amount;
datalines4;
1 aaa 10
2 aaa -10
3 aaa 8000
4 aaa -16000
5 aaa 700
6 aaa 10
7 aaa -10
8 aaa 10
;;;;
run;

proc sort data=test;
    by alias amount;
run;

data zwischen_erg;
    set test;
    by alias amount;
    if first.amount then occurrence = 0;
    occurrence+1;
run;

proc sql;
    create table zwischen as
    select
        a.my_id,
        a.alias,
        a.amount
    from zwischen_erg as a
    left join zwischen_erg as b
    on a.amount = (-1)*b.amount and a.occurrence = b.occurrence
    where b.my_id is missing;
quit;