我试图找到每个经过处理的观察的最接近的值。数据如下(来自1.2M obs的部分数据):
> dta
id treatment score
1: 5 0 0.02381024
2: 10 0 0.05428605
3: 22 0 0.02118124
4: 27 0 0.01495214
5: 45 0 0.01877916
6: 50 0 0.02120360
7: 58 0 0.02207263
8: 60 0 0.02807019
9: 61 0 0.05432927
10: 65 1 0.59612077
11: 68 0 0.02482168
12: 72 1 0.14582400
13: 73 0 0.02371670
14: 77 0 0.02608826
15: 87 0 0.06852409
16: 88 0 0.07473471
17: 94 0 0.07160314
18: 97 0 0.02040747
19: 104 1 0.09878789
20: 108 0 0.02421807
对于每个经处理的观察结果(即,治疗= 1),我希望得到具有最接近分数的未经治疗的观察结果(即,治疗= 0)并且将所选择的观察结果标记为对于其他经处理的观察结果而言是不可靠的以匹配。
例如,第一次处理的观察(第10行)将匹配id = 88(第16行),第12行至第17行,依此类推。目前我正在运行floowing循环:
smpl_treated = dta[treatment == 1]
smpl_untreated = dta[treatment == 0]
n_tmp = nrow(smpl_treated)
matched_id = matrix(0, n_tmp, 1)
smpl_tmp = smpl_untreated
for (i in 1:nrow(smpl_treated)) {
x = smpl_treated[i]$score
setkey(smpl_tmp, score)
tmp = smpl_tmp[J(x), roll = "nearest"]
matched_id[i] = tmp[[1]]
smpl_tmp = smpl_tmp[id != tmp[[1]]]
}
matched_smpl = smpl_untreated[id %in% matched_id]
> matched_smpl
id treatment score
1: 87 0 0.06852409
2: 94 0 0.07160314
3: 88 0 0.07473471
有任何建议可以在data.table中实现这一点或使循环更快吗?使用原来的1.2M obs,循环需要2个小时。感谢您的帮助!
答案 0 :(得分:1)
如果您订购数据表,制作子集并使用合并的力量,我可能会有一个解决方案。 不确定它是最好的解决方案,但它似乎适用于我理解你想要做的事情,并且肯定会比你的循环更快:
library(data.table)
dta <- data.table(id = c(5,10,22,27,45,50,58,60,61,65,68,72,73,77,87,88,94,97,104,108),
treatment = c(0, 0 ,0 ,0, 0, 0, 0 ,0 , 0 , 1, 0 ,1 ,0, 0 ,0 ,0 ,0 ,0 ,1 ,0),
score = c(0.02381024, 0.05428605, 0.02118124, 0.01495214, 0.01877916, 0.02120360,
0.02207263, 0.02807019, 0.05432927, 0.59612077, 0.02482168, 0.14582400,
0.02371670, 0.02608826, 0.06852409, 0.07473471, 0.07160314, 0.02040747,
0.09878789, 0.02421807))
setkey(dta, score) # order by score
treated_nbr <- dta[treatment == 1, .N] # just to simplify the next line
selecteddata <-
dta[treatment == 0,
.SD[(.N - treated_nbr + 1):.N,
.(correspid = id,
correspscore = score,
id = dta[treatment == 1, id])]]
这里我们采用相同数量的有序非受治疗者(.N-treated_nbr+1):.N
),以便他们与有序人员的得分最接近,并且我们将id与受治疗人员的身份合并({{1} })
id = dta[,.SD[treatment == 1,id]]
不确定它是否正是您想要的,因为我意识到只有当您的治疗分数高于未治疗分数时才有效(在您的示例中就是这种情况)。 你可以添加一个条件来使用仅针对得分高于非治疗者的治疗方案提出的解决方案,否则做其他的(否则我不会看到直接的简单解决方案)
答案 1 :(得分:1)
这使用data.table
语法的实际可能性来详细说明the already accepted answer of denis,例如,在加入时使用on
参数而不是setkey()
。< / p>
# determine the minimum number of treated and untreated cases
n <- min(dta[treatment == 0L, .N], dta[treatment == 1L, .N])
# order by descending score
mdt <- dta[order(-score)][
# and pick the ids of the top n treated and untreated cases
# so that the highest untreated score match the highest treated score,
# the 2nd highest untreated the 2nd highest treated and so forth
, .(id0 = head(.SD[treatment == 0L, id], n), id1 = head(.SD[treatment == 1L, id], n))]
mdt
id0 id1 1: 88 65 2: 94 72 3: 87 104
# join the ids two times to show the data of the treated and untreated cases
dta[dta[mdt, on = .(id==id0)], on = .(id = id1)]
id treatment score i.id i.treatment i.score 1: 65 1 0.59612077 88 0 0.07473471 2: 72 1 0.14582400 94 0 0.07160314 3: 104 1 0.09878789 87 0 0.06852409