说我有这个数据框:
# Set random seed
set.seed(33550336)
# Number of IDs
n <- 5
# Create data frames
df <- data.frame(ID = rep(1:n, each = 10),
loc = seq(10, 100, by =10))
# ID loc
# 1 1 10
# 2 1 20
# 3 1 30
# 4 1 40
# 5 1 50
# 6 1 60
# 7 1 70
# 8 1 80
# 9 1 90
# 10 1 100
# 11 2 10
# 12 2 20
# 13 2 30
# 14 2 40
# 15 2 50
# 16 2 60
# 17 2 70
# 18 2 80
# 19 2 90
# 20 2 100
# 21 3 10
# 22 3 20
# 23 3 30
# 24 3 40
# 25 3 50
# 26 3 60
# 27 3 70
# 28 3 80
# 29 3 90
# 30 3 100
# 31 4 10
# 32 4 20
# 33 4 30
# 34 4 40
# 35 4 50
# 36 4 60
# 37 4 70
# 38 4 80
# 39 4 90
# 40 4 100
# 41 5 10
# 42 5 20
# 43 5 30
# 44 5 40
# 45 5 50
# 46 5 60
# 47 5 70
# 48 5 80
# 49 5 90
# 50 5 100
现在,我想加入第二个数据框:
df_alt <- data.frame(ID = rep(1:n, each = 10),
loc = sample(1:100, 5 * n, replace = TRUE),
value = runif(n))
# ID loc value
# 1 1 87 0.3202490
# 2 1 36 0.4724253
# 3 1 53 0.4750352
# 4 1 7 0.8744985
# 5 1 38 0.2016645
# 6 1 92 0.3202490
# 7 1 74 0.4724253
# 8 1 72 0.4750352
# 9 1 73 0.8744985
# 10 1 95 0.2016645
# 11 2 61 0.3202490
# 12 2 5 0.4724253
# 13 2 87 0.4750352
# 14 2 11 0.8744985
# 15 2 10 0.2016645
# 16 2 25 0.3202490
# 17 2 60 0.4724253
# 18 2 62 0.4750352
# 19 2 52 0.8744985
# 20 2 31 0.2016645
# 21 3 3 0.3202490
# 22 3 43 0.4724253
# 23 3 45 0.4750352
# 24 3 91 0.8744985
# 25 3 51 0.2016645
# 26 3 87 0.3202490
# 27 3 36 0.4724253
# 28 3 53 0.4750352
# 29 3 7 0.8744985
# 30 3 38 0.2016645
# 31 4 92 0.3202490
# 32 4 74 0.4724253
# 33 4 72 0.4750352
# 34 4 73 0.8744985
# 35 4 95 0.2016645
# 36 4 61 0.3202490
# 37 4 5 0.4724253
# 38 4 87 0.4750352
# 39 4 11 0.8744985
# 40 4 10 0.2016645
# 41 5 25 0.3202490
# 42 5 60 0.4724253
# 43 5 62 0.4750352
# 44 5 52 0.8744985
# 45 5 31 0.2016645
# 46 5 3 0.3202490
# 47 5 43 0.4724253
# 48 5 45 0.4750352
# 49 5 91 0.8744985
# 50 5 51 0.2016645
我想要ID
的完美搭配,而loc
的最接近搭配。我查看了fuzzyjoin
包,但不幸的是,对于不同的列,您不能具有不同的模糊程度。也就是说,我无法为ID
指定完美匹配,而不能为loc
指定模糊匹配。因此,作为一种变通方法,我用ID
做左连接,计算loc.x
和loc.y
之间的距离(即,loc
和df
之间的距离df_alt
个数据帧)分别按ID
和loc.x
分组,按loc
之间的距离排序,并采用第一行(即最短距离):< / p>
# Bind and find nearest
df_res <- df %>%
left_join(df_alt, by = "ID") %>%
mutate(delta = abs(loc.x - loc.y)) %>%
group_by(ID, loc.x) %>%
arrange(delta) %>%
filter(row_number() == 1) %>%
ungroup %>%
arrange(ID, loc.x)
# # A tibble: 50 x 5
# ID loc.x loc.y value delta
# <int> <dbl> <int> <dbl> <dbl>
# 1 1 10 7 0.874 3
# 2 1 20 7 0.874 13
# 3 1 30 36 0.472 6
# 4 1 40 38 0.202 2
# 5 1 50 53 0.475 3
# 6 1 60 53 0.475 7
# 7 1 70 72 0.475 2
# 8 1 80 74 0.472 6
# 9 1 90 92 0.320 2
# 10 1 100 95 0.202 5
# 11 2 10 10 0.202 0
# 12 2 20 25 0.320 5
# 13 2 30 31 0.202 1
# 14 2 40 31 0.202 9
# 15 2 50 52 0.874 2
# 16 2 60 60 0.472 0
# 17 2 70 62 0.475 8
# 18 2 80 87 0.475 7
# 19 2 90 87 0.475 3
# 20 2 100 87 0.475 13
# 21 3 10 7 0.874 3
# 22 3 20 7 0.874 13
# 23 3 30 36 0.472 6
# 24 3 40 38 0.202 2
# 25 3 50 51 0.202 1
# 26 3 60 53 0.475 7
# 27 3 70 87 0.320 17
# 28 3 80 87 0.320 7
# 29 3 90 91 0.874 1
# 30 3 100 91 0.874 9
# 31 4 10 10 0.202 0
# 32 4 20 11 0.874 9
# 33 4 30 11 0.874 19
# 34 4 40 61 0.320 21
# 35 4 50 61 0.320 11
# 36 4 60 61 0.320 1
# 37 4 70 72 0.475 2
# 38 4 80 74 0.472 6
# 39 4 90 92 0.320 2
# 40 4 100 95 0.202 5
# 41 5 10 3 0.320 7
# 42 5 20 25 0.320 5
# 43 5 30 31 0.202 1
# 44 5 40 43 0.472 3
# 45 5 50 51 0.202 1
# 46 5 60 60 0.472 0
# 47 5 70 62 0.475 8
# 48 5 80 91 0.874 11
# 49 5 90 91 0.874 1
# 50 5 100 91 0.874 9
这不是特别有效,但是可以提供所需的结果。当数据帧变大时会出现问题。用足够大的n
重新运行上述代码会产生以下错误:
错误:无法分配大小向量...
我认为这是因为左连接产生了不必要的巨大数据帧。显然,join-then-filter不是最佳策略。但是同时进行模糊和非模糊联接的最佳方法是什么?
答案 0 :(得分:4)
我认为data.table软件包最适合此工作:
library(data.table)
setDT(df)
setDT(df_alt)
df_alt[df
, on = .(ID, loc)
, roll = "nearest"
, .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = abs(i.loc - x.loc))]
给出:
ID loc.x loc.y value delta 1: 1 10 7 0.8744985 3 2: 1 20 7 0.8744985 13 3: 1 30 36 0.4724253 6 4: 1 40 38 0.2016645 2 5: 1 50 53 0.4750352 3 6: 1 60 53 0.4750352 7 7: 1 70 72 0.4750352 2 8: 1 80 74 0.4724253 6 9: 1 90 92 0.3202490 2 10: 1 100 95 0.2016645 5 11: 2 10 10 0.2016645 0 12: 2 20 25 0.3202490 5 13: 2 30 31 0.2016645 1 14: 2 40 31 0.2016645 9 15: 2 50 52 0.8744985 2 16: 2 60 60 0.4724253 0 17: 2 70 62 0.4750352 8 18: 2 80 87 0.4750352 7 19: 2 90 87 0.4750352 3 20: 2 100 87 0.4750352 13 21: 3 10 7 0.8744985 3 22: 3 20 7 0.8744985 13 23: 3 30 36 0.4724253 6 24: 3 40 38 0.2016645 2 25: 3 50 51 0.2016645 1 26: 3 60 53 0.4750352 7 27: 3 70 53 0.4750352 17 28: 3 80 87 0.3202490 7 29: 3 90 91 0.8744985 1 30: 3 100 91 0.8744985 9 31: 4 10 10 0.2016645 0 32: 4 20 11 0.8744985 9 33: 4 30 11 0.8744985 19 34: 4 40 61 0.3202490 21 35: 4 50 61 0.3202490 11 36: 4 60 61 0.3202490 1 37: 4 70 72 0.4750352 2 38: 4 80 74 0.4724253 6 39: 4 90 92 0.3202490 2 40: 4 100 95 0.2016645 5 41: 5 10 3 0.3202490 7 42: 5 20 25 0.3202490 5 43: 5 30 31 0.2016645 1 44: 5 40 43 0.4724253 3 45: 5 50 51 0.2016645 1 46: 5 60 60 0.4724253 0 47: 5 70 62 0.4750352 8 48: 5 80 91 0.8744985 11 49: 5 90 91 0.8744985 1 50: 5 100 91 0.8744985 9