Question

我有一个SAS问题。我有一个包含唯一ID的大型数据集，以及一系列时间序列中的变量。整个时间序列中都存在一些ID，添加了一些新ID，并删除了一些旧ID。

ID    Year    Var3    Var4
1     2015    500     200
1     2016    600     300
1     2017    800     100
2     2016    200     100
2     2017    100     204
3     2015    560     969
3     2016    456     768
4     2015    543     679
4     2017    765     534

从上表可以看出，ID 1出现在所有三年（2015-2017），ID 2出现在2016年及之后，ID 3在2017年被删除，ID 4在2015年出现，被删除在2016年，然后在2017年再次出现。

我想知道哪个ID是新的，哪些ID在任何给定的年份都被移除，同时保留所有数据。例如。一个带有指标的新表，其ID是新的，并且被删除。此外，获得在给定年份中添加/删除了多少ID'的频率以及它们的“Var3”和“Var4”的总和将是很好的。你有什么建议吗？

*************更新******************

好的，所以我尝试了以下程序：

**** Addition to suggested code ****;
options validvarname=any;

proc sql noprint;
create table years as
select distinct year
from have;

create table ids as
select distinct id
from have;

create table all_id_years as
select a.id,  b.year
from ids as a,
years as b
order by id, year;

create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;

现在这将为我提供一个仅包含2017年新ID的表格：

data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;

我现在可以合并此表以添加var3和var4：

data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;

现在我可以在2017年找到新ID的频率以及var3和var4的总和：

proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;

这给了我需要的输出。但是，我想在2016年到2017年之间“消失”的ID的等效输出可以来自：

data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;

data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;

proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;

最终结果应该是将两个表“sum_var_freq_new”和“sum_var_freq_gone”组合成一个表。此外，我每个新年都需要这个表，所以我目前的方法效率非常低。你们有什么建议如何有效地实现这一目标？

Answer 1

除了不同的示例之外，您没有提供上一个问题的额外信息，以便了解上一个答案中缺少的内容。

要构建后者，您可以使用宏do循环动态地计算数据集中存在的不同year值。

data have;
infile datalines;
input ID year var3 var4;
datalines;
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
;
run;

proc sql noprint;
select distinct year
into :year1-
from have
;
quit;
%macro doWant;
  proc sql;
  create table want as
  select distinct ID
%let i=1;
%do %while(%symexist(year&i.));
        ,exists(select * from have b where year=&&year&i.. and a.id=b.id) as "&&year&i.."n
  %let i=%eval(&i.+1);
%end;
  from have a
  ;
quit;
%mend;
%doWant;

这将产生以下结果：

ID  2015 2016 2017
-----------------
1   1    1    1
2   0    1    1
3   1    1    0
4   1    0    1

Answer 2

这是一种更有效的方法，可以为您提供汇总值。

首先是一点SQL魔术。创建年份和ID的交叉产品，然后将其加入到您必须创建指标的表中;

proc sql noprint;
/*All Years*/
create table years as
select distinct year
    from have;

/*All IDS*/
create table ids as
select distinct id
    from have;

/*All combinations of ID/year*/
create table all_id_years as
select a.id,  b.year
    from ids as a,
         years as b
    order by id, year;

/*Original data with rows added for missing years.  Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
       coalesce(a.year,b.year) as year,
       coalesce(a.id/a.id,0) as indicator
    from have as a
      full join
         all_id_years as b
      on a.id = b.id
       and a.year = b.year
    order by id, year
    ;
quit;

现在换位。

proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;

创建总和。您还可以在此处添加其他摘要统计信息：

proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;

合并指标和汇总值：

data want;
merge indicators summary(keep=id var3 var4);
by id;
run;

在SAS中跟踪ID

2 个答案: