在不符合条件的情况下,按ID删除所有观察

时间:2015-03-14 02:25:17

标签: sas

我有一个包含约400万条交易记录的数据集,按Customer_No分组(每个Customer_No包含一个或多个交易,由顺序计数器表示)。每个事务都有一个Type代码,我只对使用特定事务类型组合的客户感兴趣。无论是自己加入表还是使用Proc Sql中的EXISTS,都无法有效地评估事务类型标准。我怀疑使用retain和do-loops的数据步骤会更快地处理数据集

数据集:

Customer_No Tran_Seq    Tran_Type
    0001        1           05
    0001        2           12
    0002        1           07
    0002        2           86
    0002        3           04
    0003        1           07
    0003        2           84
    0003        3           84
    0003        4           84

我想申请的标准:

  1. 所有Customer_No的Tran_Type必须只能在(' 04',' 05',' 07',& #39; 84'' 86&#39), 如果使用任何其他Tran_Type,则删除该Customer_No的所有事务

  2. Customer_No的Tran_Type必须包含(' 84'或' 86')和' 04',drop 如果不满足此条件,则为Customer_No的所有事务

  3. 我想要的输出:

    Customer_No Tran_Seq    Tran_Type
    0002        1           07
    0002        2           86
    0002        3           04  
    

4 个答案:

答案 0 :(得分:2)

如果对数据进行排序,DoW循环解决方案应该是最有效的。如果它没有排序,它将是最有效或类似的,但效率稍低,具体取决于数据集的情况。

我将Dom的解决方案与3e7 ID数据集进行了比较,并为DoW获得了类似(略少)的总长度,未分类数据集的CPU较少,排序速度提高约50%。它保证在大约数据集写出的时间长度内运行(可能多一点,但不应该多),如果需要,还要加上排序时间。

data want;
  do _n_=1 by 1 until (last.customer_no);
      set have;
      by customer_no;  
      if tran_type in ('84','86') 
        then has_8486 = 1;
      else if tran_type in ('04') 
        then has_04 = 1;
      else if not (tran_type in ('04','05','07','84','86')) 
        then has_other = 1;
  end;
  do _n_= 1 by 1 until (last.customer_no);
    set have;
    by customer_no;
    if has_8486 and has_04 and not has_other then output;
  end;
run;

答案 1 :(得分:1)

我不认为这很复杂。加入子查询group by Customer_No,并将条件放在having子句中。对于所有行,min函数中的条件必须为true,而对于任何一行,max函数中的条件必须为true:

proc sql;
create table want as
select
  h.*
from
  have h
  inner join (
    select
      Customer_No
    from
      have
    group by
      Customer_No
    having
      min(Tran_Type in('04','05','07','84','86')) and
      max(Tran_Type in('84','86')) and
      max(Tran_Type eq '04')) h2
  on h.Customer_No = h2.Customer_No
;
quit;

答案 2 :(得分:0)

我一定是连接错误了。在重写时,Proc Sql在不到30秒的时间内完成(原始的490万记录数据集)。但它并不是特别优雅的代码,所以我仍然感谢任何改进或替代方法。

data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
    0001        1           05
    0001        2           12
    0002        1           07
    0002        2           86
    0002        3           04
    0003        1           07
    0003        2           84
    0003        3           84
    0003        4           84
;
run;

Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type not in ('04','05','07','84','86')
                                  ) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type in ('84','86')
                                ) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type in ('04')
                                ) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;

答案 3 :(得分:0)

我会使用INTERSECT运算符提供比@ naed555稍微复杂的SQL解决方案。

proc sql noprint;

create table to_keep as
(
    select distinct customer_no
    from have 
    where tran_type in ('84','86')

    INTERSECT

    select distinct customer_no
    from have 
    where tran_type in ('04')
)

EXCEPT

    select distinct customer_no
    from have
    where tran_type not in ('04','05','07','84','86')
;

create table want as
select a.*
from have as a
inner join 
     to_keep as b
on a.customer_no = b.customer_no;

quit;