Question

我有一个包含约400万条交易记录的数据集，按Customer_No分组（每个Customer_No包含一个或多个交易，由顺序计数器表示）。每个事务都有一个Type代码，我只对使用特定事务类型组合的客户感兴趣。无论是自己加入表还是使用Proc Sql中的EXISTS，都无法有效地评估事务类型标准。我怀疑使用retain和do-loops的数据步骤会更快地处理数据集

数据集：

Customer_No Tran_Seq    Tran_Type
    0001        1           05
    0001        2           12
    0002        1           07
    0002        2           86
    0002        3           04
    0003        1           07
    0003        2           84
    0003        3           84
    0003        4           84

我想申请的标准：

所有Customer_No的Tran_Type必须只能在（＆＃39; 04＆＃39;，＆＃39; 05＆＃39;，＆＃39; 07＆＃39;，＆＃39; 84＆＃39;＆＃39; 86＆＃39），如果使用任何其他Tran_Type，则删除该Customer_No的所有事务
Customer_No的Tran_Type必须包含（＆＃39; 84＆＃39;或＆＃39; 86＆＃39;）和＆＃39; 04＆＃39;，drop 如果不满足此条件，则为Customer_No的所有事务

我想要的输出：

Customer_No Tran_Seq    Tran_Type
0002        1           07
0002        2           86
0002        3           04

Answer 1

如果对数据进行排序，DoW循环解决方案应该是最有效的。如果它没有排序，它将是最有效或类似的，但效率稍低，具体取决于数据集的情况。

我将Dom的解决方案与3e7 ID数据集进行了比较，并为DoW获得了类似（略少）的总长度，未分类数据集的CPU较少，排序速度提高约50％。它保证在大约数据集写出的时间长度内运行（可能多一点，但不应该多），如果需要，还要加上排序时间。

data want;
  do _n_=1 by 1 until (last.customer_no);
      set have;
      by customer_no;  
      if tran_type in ('84','86') 
        then has_8486 = 1;
      else if tran_type in ('04') 
        then has_04 = 1;
      else if not (tran_type in ('04','05','07','84','86')) 
        then has_other = 1;
  end;
  do _n_= 1 by 1 until (last.customer_no);
    set have;
    by customer_no;
    if has_8486 and has_04 and not has_other then output;
  end;
run;

Answer 2

我不认为这很复杂。加入子查询group by Customer_No，并将条件放在having子句中。对于所有行，min函数中的条件必须为true，而对于任何一行，max函数中的条件必须为true：

proc sql;
create table want as
select
  h.*
from
  have h
  inner join (
    select
      Customer_No
    from
      have
    group by
      Customer_No
    having
      min(Tran_Type in('04','05','07','84','86')) and
      max(Tran_Type in('84','86')) and
      max(Tran_Type eq '04')) h2
  on h.Customer_No = h2.Customer_No
;
quit;

Answer 3

我一定是连接错误了。在重写时，Proc Sql在不到30秒的时间内完成（原始的490万记录数据集）。但它并不是特别优雅的代码，所以我仍然感谢任何改进或替代方法。

data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
    0001        1           05
    0001        2           12
    0002        1           07
    0002        2           86
    0002        3           04
    0003        1           07
    0003        2           84
    0003        3           84
    0003        4           84
;
run;

Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type not in ('04','05','07','84','86')
                                  ) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type in ('84','86')
                                ) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
                    where Tran_Type in ('04')
                                ) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;

Answer 4

我会使用INTERSECT运算符提供比@ naed555稍微复杂的SQL解决方案。

proc sql noprint;

create table to_keep as
(
    select distinct customer_no
    from have 
    where tran_type in ('84','86')

    INTERSECT

    select distinct customer_no
    from have 
    where tran_type in ('04')
)

EXCEPT

    select distinct customer_no
    from have
    where tran_type not in ('04','05','07','84','86')
;

create table want as
select a.*
from have as a
inner join 
     to_keep as b
on a.customer_no = b.customer_no;

quit;

在不符合条件的情况下，按ID删除所有观察

4 个答案: