我正在使用SAS,我有一个这样的数据框:
表1:
+------+------------+-----------+--------+
| name | date | time | price |
+------+------------+-----------+--------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 11-Jul-08 | 11:23:41 | 2 |
| A | 3-Jan-09 | 11:31:41 | 1 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 2 |
| A | 24-Jul-09 | 11:32:41 | 3 |
| A | 24-Jul-09 | 11:32:41 | 4 |
| A | 8-Dec-09 | 12:32:41 | 1 |
| B | 7-May-08 | 11:31:41 | 2 |
| B | 10-May-08 | 11:32:41 | 3 |
| B | 17-May-08 | 11:33:41 | 4 |
| B | 24-May-08 | 11:34:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| B | 18-Jun-08 | 11:36:41 | 1 |
| B | 9-May-09 | 11:37:41 | 3 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 17-Oct-09 | 11:22:41 | 2 |
| C | 25-Oct-09 | 11:32:41 | 1 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 4-Dec-09 | 11:12:41 | 4 |
| C | 19-Dec-09 | 10:22:41 | 1 |
| C | 9-May-10 | 11:42:41 | 3 |
| C | 9-May-10 | 11:12:41 | 1 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+------------+-----------+--------+
我有另一个这样的数据框:
表2:
+------+-----------+
| name | date |
+------+-----------+
| A | 11-Jul-08 |
| A | 3-Jan-09 |
| A | 24-Jul-09 |
| B | 7-May-08 |
| B | 17-May-08 |
| B | 18-Jun-08 |
| B | 9-Jul-09 |
| C | 17-Oct-09 |
| C | 4-Dec-09 |
| C | 19-Dec-09 |
+------+-----------+
现在我想做两个主要的操作:
1-如果table2中的日期和名称在table1中,则删除table1中的那一行;
2-如果上一步发生了,那么删除该名称和日期的下一行,如果下一行的名称和日期重复其他下一行,则删除所有这些行。
例如,table1最后应该是这样的:
+------+-----------+----------+-------+
| name | date | time | price |
+------+-----------+----------+-------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+-----------+----------+-------+
这是一个不适合此操作的代码,原因有两个:
1-使用nodupkey选项删除table1中的所有重复观察,这是不必要的。因为当满足上述条件时,就会发生删除它们。
2-"(inb = 0且滞后(inb)= 1而不是first.name)"语句只删除下一行,其他下一行,同名和日期仍在table1中。
proc sort data=table1 out=tablea1 nodupkey;
by name date;
run;
proc sort data=table2 out=tableb1 nodupkey;
by name date;
run;
data want;
merge tablea1 tableb1(in=inb) ;
by name date;
if inb or (inb=0 and lag(inb)=1 and not first.name) then delete;
run;
提前致谢。
答案 0 :(得分:1)
阿明:
在复杂的合并和流程操作中,您将需要一些其他变量来维护业务规则的状态。删除匹配的下一行及其重复的情况需要跟踪下一个名称和日期。
例如:
data have;
input @;
if _infile_ ne: '+';
attrib
name length=$10
date length=4 informat=date9. format=date11.
time length=4 informat=time8. format=time8.
price length=8
;
infile cards dlm='|' firstobs=4;
input @1 name date time price;
datalines;
+------+------------+-----------+--------+
| name | date | time | price |
+------+------------+-----------+--------+
| A | 7-May-08 | 11:12:41 | 1 |
| A | 11-Jul-08 | 11:23:41 | 2 |
| A | 3-Jan-09 | 11:31:41 | 1 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 4-Jan-09 | 11:32:41 | 2 |
| A | 8-Jul-09 | 11:32:41 | 1 |
| A | 8-Jul-09 | 11:32:41 | 2 |
| A | 24-Jul-09 | 11:32:41 | 3 |
| A | 24-Jul-09 | 11:32:41 | 4 |
| A | 8-Dec-09 | 12:32:41 | 1 |
| B | 7-May-08 | 11:31:41 | 2 |
| B | 10-May-08 | 11:32:41 | 3 |
| B | 17-May-08 | 11:33:41 | 4 |
| B | 24-May-08 | 11:34:41 | 1 |
| B | 1-Jun-08 | 11:35:41 | 5 |
| B | 18-Jun-08 | 11:36:41 | 1 |
| B | 9-May-09 | 11:37:41 | 3 |
| C | 7-Oct-09 | 11:21:41 | 3 |
| C | 17-Oct-09 | 11:22:41 | 2 |
| C | 25-Oct-09 | 11:32:41 | 1 |
| C | 18-Nov-09 | 11:33:41 | 3 |
| C | 4-Dec-09 | 11:12:41 | 4 |
| C | 19-Dec-09 | 10:22:41 | 1 |
| C | 9-May-10 | 11:42:41 | 3 |
| C | 9-May-10 | 11:12:41 | 1 |
| C | 10-May-10 | 12:52:41 | 2 |
+------+------------+-----------+--------+
;
data filter;
input @;
if _infile_ ne: '+';
attrib
name length=$10
date length=4 informat=date9. format=date11.
;
infile cards dlm='|' firstobs=4;
input @1 name date;
datalines;
+------+-----------+
| name | date |
+------+-----------+
| A | 11-Jul-08 |
| A | 3-Jan-09 |
| A | 24-Jul-09 |
| B | 7-May-08 |
| B | 17-May-08 |
| B | 18-Jun-08 |
| B | 9-Jul-09 |
| C | 17-Oct-09 |
| C | 4-Dec-09 |
| C | 19-Dec-09 |
+------+-----------+
;
run;
data want (keep=name date time price);
merge have(in=_have) filter(in=_filter);
by name date;
length match_at_n 4 next_name $10 next_date 4;
retain match_at_n next_name next_date;
if first.name then /* prevent delete next from sloshing into next group */
match_at_n = -1;
if _have and _filter then do;
match_at_n = _n_;
delete;
end;
if _filter then
delete;
* condition here is _have and _not filter;
if _n_ = match_at_n + 1 then do;
next_name = name;
next_date = date;
delete;
end;
if name = next_name and date = next_date then
delete;
run;
假设使用单个复杂化合物可以实现相同的结果如果语句涉及各种滞后,标志和总和 - 无论如何,我倾向于澄清聪明。
答案 1 :(得分:0)
基于SAS community中的Ksharp代码:
data temp;
set table2(in=inb) table1;
by name date;
group+first.date;
_inb=inb;
run;
data key;
set temp(where=(_inb=1));
output;
group=group+1;
output;
keep name group;
run;
proc sql;
create table want as
select name, date, time, price
from temp
where catx(' ',name,group) not in
(select catx(' ',name,group) from key);
quit;
答案 2 :(得分:-1)
data want;
merge table1 table2(in=inb);
by name date;
retain _date num;
if first.name then call missing(_date,num);
if inb then do;
num=_n_;
delete;
end;
else if _n_-num=1 then do;
_date=date;
delete;
end;
else if _date=date then delete;
drop _date num;
run;