PROC SQL newbie here - 我想使用Proc SQL来连接(堆栈)来自两个不同数据集的ID和Race数据,同时也只删除ID(而不是ID和Race)的重复项 - 这可能吗?例如,在合并下面的数据后,我只想要ID = 1的第一个实例(其中Race = white),而不是{(1,White)和(1,Black)}
示例数据:
DATA SAMPLE1;
INPUT ID RACE$;
DATALINES;
1 WHITE
2 BLACK
3 WHITE
4 BLANK
;
RUN;
DATA SAMPLE2;
INPUT ID RACE$;
DATALINES;
5 HISPANIC
6 ASIAN
7 HISPANIC
8 ASIAN
1 BLACK
;
RUN;
答案 0 :(得分:3)
这不是SQL与普通SAS一样好的东西,但它肯定是可能的。
一些选择:
外部加入,使用COALESCE。写入比其他选项更难,因为你必须在初始选择中将每个变量写两次。
proc sql;
select coalesce(s1.id,s2.id) as id, coalescec(s1.race,s2.race) as race from (
(select * from sample2) s2
full outer join
(select *,"1" as sample1 from sample1) s1
on s2.id=s1.id);
quit;
与EXISTS子查询联合。根据表的大小较慢;如果这是一个10k表和10行表,这是一个快速的解决方案;如果它是2个10k表,这很慢。
proc sql;
select * from sample1
union
select * from sample2 where not exists (
select 1 from sample1 where sample1.id=sample2.id
);
quit;
与JOIN联盟。可能比上面的查询更快,具体取决于索引等。
proc sql;
select * from sample1
union
select sample2.* from sample2
left join sample1
on sample1.id=sample2.id
where missing(sample1.id);
quit;
但SAS中最简单的解决方案无疑是在SAS中实现的。
data sample12_view/view=sample12_view;
set sample1 sample2;
run;
proc sort nodupkey data=sample12_view out=sample12;
by id;
run;
或
data sample12;
merge sample1(in=s1) sample2(in=s2);
by id;
run;
在这种情况下,s2会替换s1,所以如果您更喜欢其他选项,请更改合并语句中的顺序。
答案 1 :(得分:0)
实际上,您应该指定要保留的重复项 - SQL会尝试确定性。这样的事情应该有效:
proc sql;
create table both_samples as
select * from (
(select *
from sample1 )
union ( select *
from sample2 )
)
group by id
having race = max( race )
;
quit;
proc print data = both_samples noobs;
run;
1 WHITE
2 BLACK
3 WHITE
4 BLANK
5 HISPANIC
6 ASIAN
7 HISPANIC
8 ASIAN
答案 2 :(得分:0)
这为您提供了您指定的答案:
proc sql;
create table all as
select monotonic() as _n_, * from sample1
union all
select monotonic() as _n_, * from sample2;
create table distinct_ids as
select id, min(_n_) as _n_ from all group by 1;
create table results as
select a.id
,(select race from all where all.id=a.id and all._n_=a._n_) as race
from distinct_ids a;