我有一个包含约400万条交易记录的数据集,按Customer_No分组(每个Customer_No包含一个或多个交易,由顺序计数器表示)。每个事务都有一个Type代码,我只对使用特定事务类型组合的客户感兴趣。无论是自己加入表还是使用Proc Sql中的EXISTS,都无法有效地评估事务类型标准。我怀疑使用retain和do-loops的数据步骤会更快地处理数据集
数据集:
Customer_No Tran_Seq Tran_Type
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
我想申请的标准:
所有Customer_No的Tran_Type必须只能在(' 04',' 05',' 07',& #39; 84'' 86&#39), 如果使用任何其他Tran_Type,则删除该Customer_No的所有事务
Customer_No的Tran_Type必须包含(' 84'或' 86')和' 04',drop 如果不满足此条件,则为Customer_No的所有事务
我想要的输出:
Customer_No Tran_Seq Tran_Type
0002 1 07
0002 2 86
0002 3 04
答案 0 :(得分:2)
如果对数据进行排序,DoW循环解决方案应该是最有效的。如果它没有排序,它将是最有效或类似的,但效率稍低,具体取决于数据集的情况。
我将Dom的解决方案与3e7 ID数据集进行了比较,并为DoW获得了类似(略少)的总长度,未分类数据集的CPU较少,排序速度提高约50%。它保证在大约数据集写出的时间长度内运行(可能多一点,但不应该多),如果需要,还要加上排序时间。
data want;
do _n_=1 by 1 until (last.customer_no);
set have;
by customer_no;
if tran_type in ('84','86')
then has_8486 = 1;
else if tran_type in ('04')
then has_04 = 1;
else if not (tran_type in ('04','05','07','84','86'))
then has_other = 1;
end;
do _n_= 1 by 1 until (last.customer_no);
set have;
by customer_no;
if has_8486 and has_04 and not has_other then output;
end;
run;
答案 1 :(得分:1)
我不认为这很复杂。加入子查询group by Customer_No
,并将条件放在having
子句中。对于所有行,min
函数中的条件必须为true,而对于任何一行,max
函数中的条件必须为true:
proc sql;
create table want as
select
h.*
from
have h
inner join (
select
Customer_No
from
have
group by
Customer_No
having
min(Tran_Type in('04','05','07','84','86')) and
max(Tran_Type in('84','86')) and
max(Tran_Type eq '04')) h2
on h.Customer_No = h2.Customer_No
;
quit;
答案 2 :(得分:0)
我一定是连接错误了。在重写时,Proc Sql在不到30秒的时间内完成(原始的490万记录数据集)。但它并不是特别优雅的代码,所以我仍然感谢任何改进或替代方法。
data Have;
input Customer_No $ Tran_Seq $ Tran_Type:$2.;
cards;
0001 1 05
0001 2 12
0002 1 07
0002 2 86
0002 3 04
0003 1 07
0003 2 84
0003 3 84
0003 4 84
;
run;
Proc sql;
Create table Want as
select t1.* from Have t1
LEFT JOIN (select DISTINCT Customer_No from Have
where Tran_Type not in ('04','05','07','84','86')
) t2
ON(t1.Customer_No=t2.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('84','86')
) t3
ON(t1.Customer_No=t3.Customer_No)
INNER JOIN (select DISTINCT Customer_No from Have
where Tran_Type in ('04')
) t4
ON(t1.Customer_No=t4.Customer_No)
Where t2.Customer_No is null
;Quit;
答案 3 :(得分:0)
我会使用INTERSECT运算符提供比@ naed555稍微复杂的SQL解决方案。
proc sql noprint;
create table to_keep as
(
select distinct customer_no
from have
where tran_type in ('84','86')
INTERSECT
select distinct customer_no
from have
where tran_type in ('04')
)
EXCEPT
select distinct customer_no
from have
where tran_type not in ('04','05','07','84','86')
;
create table want as
select a.*
from have as a
inner join
to_keep as b
on a.customer_no = b.customer_no;
quit;