我想要做的接近倾向得分匹配(或因果匹配,MatchIt),但并不完全相同。
我只是想从包含混合变量(分类和数值)的数据集中查找和收集最近的(成对的)观测值。
数据集如下:
id child age edu y
1 11011209 0 69 some college 495
2 11011212 0 44 secondary/primary 260
3 11011213 1 40 some college 175
4 11020208 1 47 secondary/primary 0
5 11020212 1 50 secondary/primary 25
6 11020310 0 65 secondary/primary 525
7 11020315 1 43 college 0
8 11020316 1 41 secondary/primary 5
9 11031111 0 49 secondary/primary 275
10 11031116 1 42 secondary/primary 0
11 11031119 0 32 college 425
12 11040801 1 38 secondary/primary 0
13 11040814 0 52 some college 260
14 11050109 0 59 some college 405
15 11050111 1 35 secondary/primary 20
16 11050113 0 51 secondary/primary 40
17 11051001 1 38 college 165
18 11051004 1 36 college 10
19 11051011 0 63 secondary/primary 455
20 11051018 0 44 college 40
我要匹配的是变量{child, age, edu}
,而不是y
(也不是id
)。
因为我使用的数据集带有混合变量,所以可以使用 gower 距离
library(cluster)
# test on first ten observations
dt = dt[1:10, ]
# gower distance
ddmen = daisy(dt[,-c(1,5)], metric = 'gower')
现在,我想检索最接近的观测值
mg = as.matrix(ddmen)
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
close = mgg %>% dplyr::select(Var2, closest, dis = m) %>% distinct()
close
给了我
Var2 closest dis
1 1 6 0.37931034
2 2 9 0.05747126
3 3 8 0.34482759
4 4 5 0.03448276
5 5 4 0.03448276
6 6 9 0.18390805
7 7 10 0.34482759
8 8 10 0.01149425
9 9 2 0.05747126
10 10 8 0.01149425
我可以将close
合并到我的原始数据
dt$id = 1:10
dt2 = merge(dt, close, by.x = 'id', by.y = 'Var2', all = T)
然后将其绑定
vlist = vector('list', 10)
for(i in 1:10){
vlist[[i]] = dt2[ c( which(dt2$id == i), dt2$closest[dt2$id == i] ), ] %>%
mutate(p = i)
}
bind_rows(vlist)
并获得
id child age edu y closest dis p
1 1 0 69 some college 495 6 0.37931034 1
2 6 0 65 secondary/primary 525 9 0.18390805 1
3 2 0 44 secondary/primary 260 9 0.05747126 2
4 9 0 49 secondary/primary 275 2 0.05747126 2
...
然后, p
是基于id的匹配对的标识符。因此,您会注意到个人可以成对出现(因为1对2的最接近匹配不一定是对称的,所以2可能比1更接近另一个匹配)。
问题
首先,这里的代码中有一个小错误:
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
我收到此错误消息Column closest must be length 19 (the group size) or one, not 2
该代码适用于10个观测值,但不适用于20个观测值(此处提供了完整的数据集)。 为什么?
第二,是否有可用于自动执行此操作的软件包?
dt = structure(list(id = c(11011209L, 11011212L, 11011213L, 11020208L,
11020212L, 11020310L, 11020315L, 11020316L, 11031111L, 11031116L,
11031119L, 11040801L, 11040814L, 11050109L, 11050111L, 11050113L,
11051001L, 11051004L, 11051011L, 11051018L), child = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 1L), .Label = c("0", "1"), class = "factor"), age = c(69L,
44L, 40L, 47L, 50L, 65L, 43L, 41L, 49L, 42L, 32L, 38L, 52L, 59L,
35L, 51L, 38L, 36L, 63L, 44L), edu = structure(c(3L, 2L, 3L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 1L, 2L,
1L), .Label = c("college", "secondary/primary", "some college"
), class = "factor"), y = c(495, 260, 175, 0, 25, 525, 0, 5,
275, 0, 425, 0, 260, 405, 20, 40, 165, 10, 455, 40)), class = "data.frame",
.Names = c("id",
"child", "age", "edu", "y"), row.names = c(NA, -20L))