我有一个每小时频率的面板数据集。如果在任何给定的一小时间隔内观察少于200次观察,我想删除所有观察结果。因此,我首先计算每小时的观察次数N
,然后删除N
< data lib.data;
set lib.data;
retain I; by date hour;
if first.date or first.hour then I=1; else I=I+1;
run;
proc sql;
create table lib.data1
as select a.*, max(I) as N
from lib.data as a
group by date, hour
order by date, hour;
quit;
data lib.data (drop= i n);
set lib.data;
if n < 200 then delete;
run;
200.但是,步骤2中常见的proc sql耗尽了我所有的C盘可用空间。有没有更好的方法来实现我的目标?
/* ----------------- job_name ----------------- */
update_job: job_name job_type: CMD
command: . /home/../mybashScript.sh "param1Value" "param2Value"
machine: machine.domain.com
owner: username
/* Other Parameters like profile, date conditions etc. */
答案 0 :(得分:2)
使用双DOW循环。第一个将计算记录数。然后第二个可以使用该计数有条件地执行OUTPUT语句。
data want ;
do until (last.hour);
set lib.data;
by date hour;
n=sum(n,1);
end;
do until (last.hour);
set lib.data;
by date hour;
if n >= 200 then output;
end;
run;
答案 1 :(得分:2)
PROC SQL
本身不是问题。没有GROUP BY
中所有非汇总列的意外后果(例如重新汇总数据)。这是一个SQL解决方案,希望不会炸毁您的驱动器。
proc sql;
create table want as
select
a.*
from
lib.data a
join
(select
date,
hour,
count(*)
from
lib.data
group by date, hour
having count(*) >= 200) b
on
a.date = b.date and
a.hour = b.hour
;
quit;
答案 2 :(得分:0)
您可以尝试使用哈希表来存储前200条记录。当你从哈希表中获得第200个记录输出数据时,从当前时间到达其余的观察结果。 下面的代码显示了它的工作原理:
data lib.data (drop= counter rc);
set lib.data;
by date hour;
retain counter 0;
If _N_ =1 then do;
declare hash hs(multidata:'yes');
hs.definekey('date','hour');
hs.definedone();
end;
/*if first record in hour zero counter*/
if first.hour then do;
counter=0;
end;
/*increment counter*/
counter = counter+1;
/*if counter less then 200 add record to hash table*/
if counter < 200 then do;
hs.add();
end;
/*if counter=200 output current record and record from hash*/
if counter = 200 then do;
output;
rc = hs.find();
do while(rc=0);
output;
rc= hs.find_next();
end;
end;
/*if counter greater then 200 output current record*/
if counter > 200 then output;
/*if last record in hour clear hash*/
if last.hour then do;
hs.clear();
end;
run;