根据另一个没有匹配案例的数据集的组随机抽样

时间:2017-07-28 04:44:36

标签: r

我有两个像这样的数据集:

df <- data.frame(id = 1:20,
             Sex = rep(x = c(0,1), each=10),
             age = c(25,56,29,42,33,33,33,25,25,25,26,57,30,43,34,34,34,26,26,26),
             ov = letters[1:20])

df1 <- data.frame(Sex = c(0,0,0,1,1),
              age = c(25,33,39,41,43))

我想根据每组df1为每组性别和年龄df取一个随机行,但并非所有df1中的年龄都与df相匹配,所以我想对df1中的每个组都进行不匹配在df中,var ov的值与同性和最接近的年龄相关,如下所示:

df3 <- rbind(df[c(8,7),2:4],c(0,39,"d"),c(1,41,"n"),df[14,2:4])

请注意,性别= 0且年龄= 39的情况下的捐赠者是df [4,]并且注意到性别= 1且年龄= 41的情况下的捐赠者是df [14,]

我该怎么做:

1 个答案:

答案 0 :(得分:1)

使用data.table你可以尝试这样的事情:

1)将数据转换为data.table并添加密钥:

df1
dt1 <- as.data.table(df1) # convert to data.table
dt1[, newSex := Sex] # this will serve as grouping column
dt1[, newage := age] # also this
setkey(dt1, Sex, age) # set data.tables keys
dt1
   Sex age newSex newage
1:   0  25      0     25
2:   0  33      0     33
3:   0  39      0     39
4:   1  41      1     41
5:   1  43      1     43

# we do similar with df:
dt <- as.data.table(df)
setkey(dt, Sex, age)
dt
    id Sex age ov
 1:  1   0  25  a
 2:  8   0  25  h
 3:  9   0  25  i
 4: 10   0  25  j
 5:  3   0  29  c
 6:  5   0  33  e
 7:  6   0  33  f
 8:  7   0  33  g
 9:  4   0  42  d
10:  2   0  56  b
11: 11   1  26  k
12: 18   1  26  r
13: 19   1  26  s
14: 20   1  26  t
15: 13   1  30  m
16: 15   1  34  o
17: 16   1  34  p
18: 17   1  34  q
19: 14   1  43  n
20: 12   1  57  l

2)使用滚动合并,我们得到dtnew新组:

dtnew <- dt1[dt, roll = "nearest"]
dtnew
    Sex age newSex newage id ov
 1:   0  25      0     25  1  a
 2:   0  25      0     25  8  h
 3:   0  25      0     25  9  i
 4:   0  25      0     25 10  j
 5:   0  29      0     25  3  c
 6:   0  33      0     33  5  e
 7:   0  33      0     33  6  f
 8:   0  33      0     33  7  g
 9:   0  42      0     39  4  d
10:   0  56      0     39  2  b
11:   1  26      1     41 11  k
12:   1  26      1     41 18  r
13:   1  26      1     41 19  s
14:   1  26      1     41 20  t
15:   1  30      1     41 13  m
16:   1  34      1     41 15  o
17:   1  34      1     41 16  p
18:   1  34      1     41 17  q
19:   1  43      1     43 14  n
20:   1  57      1     43 12  l

3)现在我们可以提供样品。在您的情况下,我们可以简单地按随机顺序重新排序行,然后取每组的第一行:

dtnew <- dtnew[sample(.N)] #create random order
sampleDT <- unique(dtnew, by = c("newSex", "newage")) #take first unique by newSex and newage
sampleDT
   Sex age newSex newage id ov
1:   0  56      0     39  2  b
2:   0  29      0     25  3  c
3:   1  43      1     43 14  n
4:   1  34      1     41 16  p
5:   0  33      0     33  7  g