我有一个数据集work.test1,包含4个变量hhid(家庭id),pid(person id),pidlink(hhid和pid的组合)和bin(正面或负面)。
示例数据如下所示:
init_values <- function(){
list(alpha = rnorm(2), beta = rnorm(2), sigma = runif(1))
}
params <- c("alpha", "beta", "sigma")
fit_lm2 <- jags(data = jagsdata_s2, inits = init_values, parameters.to.save = params, model.file = lm2_jags,
n.chains = 3, n.iter = 12000, n.burnin = 2000, n.thin = 10, DIC = F)
现在我想创建一个数据集work.test2,它应该只包含bin 2(如果家中有bin = 2)或bin 1(如果家中没有bin 2)的唯一hhid 。如果有多个bin = 2,我会选择第一个。如果没有垃圾箱2但是有超过1垃圾箱1我会选择第一个垃圾箱。结果数据集应该只有唯一的hhid(每个家庭单个条目)。
结果输出应如下所示:
obs hhid pid pidlink bin
1 10600 1 1060001 1
2 10600 1 1060001 1
3 10800 1 1080001 1
4 10800 1 1080001 1
5 10800 2 1080002 1
6 10800 2 1080002 2
7 12200 1 1220001 1
8 12200 1 1220001 2
谢谢
答案 0 :(得分:0)
至于数据和输出显示group by和max函数应该起作用并给出我想要的结果。
data have(drop =obs); input obs hhid pid pidlink bin; datalines; 1 10600 1 1060001 1 2 10600 1 1060001 1 3 10800 1 1080001 1 4 10800 1 1080001 1 5 10800 2 1080002 1 6 10800 2 1080002 2 7 12200 1 1220001 1 8 12200 1 1220001 2 ;
proc sql;
select hhid, max(pid) as pid, max(pidlink) as pidlink, max(bin) as bin
from have
group by 1;
如果您有更多列,那么它会变得有点棘手,但您可以这样做但是您需要更多选择,否则您将获得更多记录。请参阅下面的查询
data have(drop =obs); input obs hhid pid pidlink bin anotherval1 abotherval2 $; datalines; 1 10600 1 1060001 1 7 A 2 10600 1 1060001 1 8 B 3 10800 1 1080001 1 6 C 4 10800 1 1080001 1 8 D 5 10800 2 1080002 1 8 E 6 10800 2 1080002 2 9 F 7 12200 1 1220001 1 10 G 8 12200 1 1220001 2 7 H ;
proc sql;
select * from have
group by 1
having pid= max(pid)
and pidlink = max(pidlink)
and bin = max(bin) ;
如果您想只包含其他列的不同记录,那么
data have1;
set have;
val =_n_;
run;
proc sql;
create table have2(drop =val) as
select * from
(select * from have1
group by 1
having pid= max(pid)
and pidlink = max(pidlink)
and bin = max(bin))a
group by hhid, pid, pid,bin
having val=min(val);