基于距离算法的匹配观测

时间:2018-07-21 16:09:38

标签: r match distance matching

我想要做的接近倾向得分匹配(或因果匹配,MatchIt),但并不完全相同。

我只是想从包含混合变量(分类和数值)的数据集中查找和收集最近的(成对的)观测值。

数据集如下:

         id child age               edu   y
1  11011209     0  69      some college 495
2  11011212     0  44 secondary/primary 260
3  11011213     1  40      some college 175
4  11020208     1  47 secondary/primary   0
5  11020212     1  50 secondary/primary  25
6  11020310     0  65 secondary/primary 525
7  11020315     1  43           college   0
8  11020316     1  41 secondary/primary   5
9  11031111     0  49 secondary/primary 275
10 11031116     1  42 secondary/primary   0
11 11031119     0  32           college 425
12 11040801     1  38 secondary/primary   0
13 11040814     0  52      some college 260
14 11050109     0  59      some college 405
15 11050111     1  35 secondary/primary  20
16 11050113     0  51 secondary/primary  40
17 11051001     1  38           college 165
18 11051004     1  36           college  10
19 11051011     0  63 secondary/primary 455
20 11051018     0  44           college  40

我要匹配的是变量{child, age, edu},而不是y(也不是id)。

因为我使用的数据集带有混合变量,所以可以使用 gower 距离

library(cluster)     

# test on first ten observations 
dt = dt[1:10, ]
# gower distance
ddmen = daisy(dt[,-c(1,5)], metric = 'gower')

现在,我想检索最接近的观测值

mg = as.matrix(ddmen)
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m = 
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()

close = mgg %>% dplyr::select(Var2, closest, dis = m) %>% distinct()

close给了我

   Var2 closest        dis
1     1       6 0.37931034
2     2       9 0.05747126
3     3       8 0.34482759
4     4       5 0.03448276
5     5       4 0.03448276
6     6       9 0.18390805
7     7      10 0.34482759
8     8      10 0.01149425
9     9       2 0.05747126
10   10       8 0.01149425

我可以将close合并到我的原始数据

dt$id = 1:10
dt2 = merge(dt, close, by.x = 'id', by.y = 'Var2', all = T)

然后将其绑定

vlist = vector('list', 10)
for(i in 1:10){
  vlist[[i]] = dt2[ c( which(dt2$id == i), dt2$closest[dt2$id == i] ), ] %>% 
mutate(p = i)
}

bind_rows(vlist)

并获得

   id child age               edu   y closest        dis  p
1   1     0  69      some college 495       6 0.37931034  1
2   6     0  65 secondary/primary 525       9 0.18390805  1
3   2     0  44 secondary/primary 260       9 0.05747126  2
4   9     0  49 secondary/primary 275       2 0.05747126  2
...
然后,

p是基于id的匹配对的标识符。因此,您会注意到个人可以成对出现(因为1对2的最接近匹配不一定是对称的,所以2可能比1更接近另一个匹配)。

问题

首先,这里的代码中有一个小错误:

mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m = 
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()

我收到此错误消息Column closest must be length 19 (the group size) or one, not 2

该代码适用于10个观测值,但不适用于20个观测值(此处提供了完整的数据集)。 为什么?

第二,是否有可用于自动执行此操作的软件包?

dt = structure(list(id = c(11011209L, 11011212L, 11011213L, 11020208L, 
11020212L, 11020310L, 11020315L, 11020316L, 11031111L, 11031116L, 
11031119L, 11040801L, 11040814L, 11050109L, 11050111L, 11050113L, 
11051001L, 11051004L, 11051011L, 11051018L), child = structure(c(1L, 
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 
2L, 1L, 1L), .Label = c("0", "1"), class = "factor"), age = c(69L, 
44L, 40L, 47L, 50L, 65L, 43L, 41L, 49L, 42L, 32L, 38L, 52L, 59L, 
35L, 51L, 38L, 36L, 63L, 44L), edu = structure(c(3L, 2L, 3L, 
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 1L, 2L, 
1L), .Label = c("college", "secondary/primary", "some college"
), class = "factor"), y = c(495, 260, 175, 0, 25, 525, 0, 5, 
275, 0, 425, 0, 260, 405, 20, 40, 165, 10, 455, 40)), class = "data.frame", 
.Names = c("id", 
"child", "age", "edu", "y"), row.names = c(NA, -20L))

0 个答案:

没有答案