我被困在合并两个数据集中,其简单的复杂性远远超出我在R中的流利程度。我试图从here和here学习,但无法解决我的问题。我正在尝试合并以下两个数据框:
DF1
No County Route Number
1 Anderson SR009 6150
2 Anderson SR061 5880
3 Bedford SR016 9500
4 Bedford SR130 320
5 .
6 .
7 .
8 .
DF2
No. County Route Number1 abc def
1 Clay 02264 4500 50 789
2 Dickson 01544 5870 45 33
3 Anderson 01421 981 70 65
4 Anderson SR009 10000 56 56
5 Anderson SR009 6145 32 53
6 Bedford SR016 7500 23 32
7 Anderson SR061 4400 12 24
8 Anderson SR061 5875 87 26
9 Anderson SR061 15000 45 45
10 Bedford SR016 22000 71 75
11 Bedford SR016 9450 145 615
12 Bedford SR130 900 7854 76
13 Bedford SR130 310 124 25
14 Anderson SR061 5865 312 123
.
.
.
首先,应比较df1和df2中的“county”和“Route”列,如果它们完全匹配,则应选择df2 $ Number1的特定行,其值为NEAREST到df1 $ Number,因此应将所有唯一的df2列添加到df1
这是我想要实现的伪代码:
if(df1$County == Anderson & df2$County == Anderson) && if(df1$Route == SR009 & df2$Route == SR009)
then select specific row from df2$Number1 whose value is nearest to the df1$Number value,
and add all subsequent columns of df2 to corresponding row in df1
一个例子:
基于“县”和“路线”列,df1中的第1行与df2中的第4行和第5行匹配。现在,在df1中与第一行匹配的两个df2行中,我想选择df2中的特定行,其“Number1”值最接近df1中的“Number”值,即6150.说这个,我想选择行5在df2中,因为“Number1”值是6145,最接近6150,并将所有后续列从df2添加到df1 ...
最终输出如下:
No County Route Number Number1 abc def . .
1 Anderson SR009 6150 6145 32 53 . .
2 Anderson SR061 5880 5875 87 26 . .
3 Bedford SR016 9500 9450 145 615 . .
4 Bedford SR139 320 310 124 25 . .
. .
. .
我非常感谢您提供的任何帮助。对不起,很长的帖子。
答案 0 :(得分:0)
你的问题有点令人困惑。从中可以看出,我希望以下dplyr
方法对您有用。
library(dplyr)
d1%>%
full_join(d2, by = c("County", "Route")) %>%
group_by(County, Route) %>%
mutate(myDiff = abs(Number - Number1)) %>%
slice(which.min(myDiff))
答案 1 :(得分:0)
使用library(data.table)
setkey(dt1, County, Route)
setkey(dt2, County, Route)
dt3 = dt1[dt2]
dt3[, Number.close := Number1[which.min(abs(Number1-Number))], by = .(County, Route)]
dt3 = dt3[Number.close == Number1, ][, Number.close:=NULL][]
# No County Route Number No. Number1 abc def
# 1: 1 Anderson SR009 6150 5 6145 32 53
# 2: 2 Anderson SR061 5880 8 5875 87 26
# 3: 3 Bedford SR016 9500 11 9450 145 615
# 4: 4 Bedford SR130 320 13 310 124 25
数据:
dt1 = structure(list(No = 1:4, County = c("Anderson", "Anderson", "Bedford",
"Bedford"), Route = c("SR009", "SR061", "SR016", "SR130"), Number = c(6150L,
5880L, 9500L, 320L)), .Names = c("No", "County", "Route", "Number"
), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County",
"Route"))
dt2 = structure(list(No. = c(3L, 4L, 5L, 7L, 8L, 9L, 14L, 6L, 10L,
11L, 12L, 13L, 1L, 2L), County = c("Anderson", "Anderson", "Anderson",
"Anderson", "Anderson", "Anderson", "Anderson", "Bedford", "Bedford",
"Bedford", "Bedford", "Bedford", "Clay", "Dickson"), Route = c("01421",
"SR009", "SR009", "SR061", "SR061", "SR061", "SR061", "SR016",
"SR016", "SR016", "SR130", "SR130", "02264", "01544"), Number1 = c(981L,
10000L, 6145L, 4400L, 5875L, 15000L, 5865L, 7500L, 22000L, 9450L,
900L, 310L, 4500L, 5870L), abc = c(70L, 56L, 32L, 12L, 87L, 45L,
312L, 23L, 71L, 145L, 7854L, 124L, 50L, 45L), def = c(65L, 56L,
53L, 24L, 26L, 45L, 123L, 32L, 75L, 615L, 76L, 25L, 789L, 33L
)), .Names = c("No.", "County", "Route", "Number1", "abc", "def"
), row.names = c(NA, -14L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County",
"Route"))