我有一个SAS问题。我有一个包含唯一ID的大型数据集,以及一系列时间序列中的变量。整个时间序列中都存在一些ID,添加了一些新ID,并删除了一些旧ID。
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
从上表可以看出,ID 1出现在所有三年(2015-2017),ID 2出现在2016年及之后,ID 3在2017年被删除,ID 4在2015年出现,被删除在2016年,然后在2017年再次出现。
我想知道哪个ID是新的,哪些ID在任何给定的年份都被移除,同时保留所有数据。例如。一个带有指标的新表,其ID是新的,并且被删除。此外,获得在给定年份中添加/删除了多少ID'的频率以及它们的“Var3”和“Var4”的总和将是很好的。你有什么建议吗?
*************更新******************
好的,所以我尝试了以下程序:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
现在这将为我提供一个仅包含2017年新ID的表格:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
我现在可以合并此表以添加var3和var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
现在我可以在2017年找到新ID的频率以及var3和var4的总和:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
这给了我需要的输出。但是,我想在2016年到2017年之间“消失”的ID的等效输出可以来自:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
最终结果应该是将两个表“sum_var_freq_new”和“sum_var_freq_gone”组合成一个表。此外,我每个新年都需要这个表,所以我目前的方法效率非常低。你们有什么建议如何有效地实现这一目标?
答案 0 :(得分:0)
除了不同的示例之外,您没有提供上一个问题的额外信息,以便了解上一个答案中缺少的内容。
要构建后者,您可以使用宏do循环动态地计算数据集中存在的不同year
值。
data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;
proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
proc sql;
create table want as
select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
%let i=%eval(&i.+1);
%end;
from have a
;
quit;
%mend;
%doWant;
这将产生以下结果:
ID 2015 2016 2017
-----------------
1 1 1 1
2 0 1 1
3 1 1 0
4 1 0 1
答案 1 :(得分:0)
这是一种更有效的方法,可以为您提供汇总值。
首先是一点SQL魔术。创建年份和ID的交叉产品,然后将其加入到您必须创建指标的表中;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
现在换位。
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
创建总和。您还可以在此处添加其他摘要统计信息:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
合并指标和汇总值:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;