根据常见的两列和两列合并两个数据帧。如果第三列中的值彼此最接近

时间:2016-10-14 23:52:21

标签: r

我被困在合并两个数据集中,其简单的复杂性远远超出我在R中的流利程度。我试图从herehere学习,但无法解决我的问题。我正在尝试合并以下两个数据框:

DF1

No    County       Route      Number
1     Anderson       SR009       6150
2     Anderson       SR061       5880
3     Bedford        SR016       9500
4     Bedford        SR130       320
5       .
6       .
7       .
8       .

DF2

No.  County        Route     Number1    abc      def
1    Clay          02264     4500        50       789
2     Dickson       01544     5870       45       33
3     Anderson      01421     981        70       65
4     Anderson      SR009     10000      56       56
5     Anderson      SR009     6145       32       53
6     Bedford       SR016     7500       23       32
7     Anderson      SR061     4400       12       24
8     Anderson      SR061     5875       87       26
9     Anderson      SR061     15000      45       45
10     Bedford       SR016     22000     71       75
11     Bedford       SR016     9450      145      615
12     Bedford       SR130     900       7854     76
13     Bedford       SR130     310       124      25
14     Anderson      SR061     5865      312      123
       .
       .
       .

首先,应比较df1和df2中的“county”和“Route”列,如果它们完全匹配,则应选择df2 $ Number1的特定行,其值为NEAREST到df1 $ Number,因此应将所有唯一的df2列添加到df1

这是我想要实现的伪代码:

if(df1$County == Anderson & df2$County == Anderson) && if(df1$Route == SR009 & df2$Route == SR009) 
then select specific row from df2$Number1 whose value is nearest to the df1$Number value, 
and add all subsequent columns of df2 to corresponding row in df1

一个例子:

基于“县”和“路线”列,df1中的第1行与df2中的第4行和第5行匹配。现在,在df1中与第一行匹配的两个df2行中,我想选择df2中的特定行,其“Number1”值最接近df1中的“Number”值,即6150.说这个,我想选择行5在df2中,因为“Number1”值是6145,最接近6150,并将所有后续列从df2添加到df1 ...

最终输出如下:

No      County           Route       Number     Number1     abc    def  .  .    
1       Anderson         SR009       6150       6145        32     53   .  .
2       Anderson         SR061       5880       5875        87     26   .  .
3       Bedford          SR016       9500       9450        145    615  .  .
4       Bedford          SR139       320        310         124    25   .  .
.          .
.          .

我非常感谢您提供的任何帮助。对不起,很长的帖子。

2 个答案:

答案 0 :(得分:0)

你的问题有点令人困惑。从中可以看出,我希望以下dplyr方法对您有用。

library(dplyr)

d1%>%
  full_join(d2, by = c("County", "Route")) %>%
  group_by(County, Route) %>%
  mutate(myDiff = abs(Number - Number1)) %>%
  slice(which.min(myDiff))

答案 1 :(得分:0)

使用library(data.table)

setkey(dt1, County, Route)
setkey(dt2, County, Route)
dt3 = dt1[dt2]
dt3[, Number.close := Number1[which.min(abs(Number1-Number))], by = .(County, Route)]
dt3 = dt3[Number.close == Number1, ][, Number.close:=NULL][]

#    No   County Route Number No. Number1 abc def
# 1:  1 Anderson SR009   6150   5    6145  32  53
# 2:  2 Anderson SR061   5880   8    5875  87  26
# 3:  3  Bedford SR016   9500  11    9450 145 615
# 4:  4  Bedford SR130    320  13     310 124  25

数据:

dt1 = structure(list(No = 1:4, County = c("Anderson", "Anderson", "Bedford", 
"Bedford"), Route = c("SR009", "SR061", "SR016", "SR130"), Number = c(6150L, 
5880L, 9500L, 320L)), .Names = c("No", "County", "Route", "Number"
), row.names = c(NA, -4L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County", 
"Route"))

dt2 = structure(list(No. = c(3L, 4L, 5L, 7L, 8L, 9L, 14L, 6L, 10L, 
11L, 12L, 13L, 1L, 2L), County = c("Anderson", "Anderson", "Anderson", 
"Anderson", "Anderson", "Anderson", "Anderson", "Bedford", "Bedford", 
"Bedford", "Bedford", "Bedford", "Clay", "Dickson"), Route = c("01421", 
"SR009", "SR009", "SR061", "SR061", "SR061", "SR061", "SR016", 
"SR016", "SR016", "SR130", "SR130", "02264", "01544"), Number1 = c(981L, 
10000L, 6145L, 4400L, 5875L, 15000L, 5865L, 7500L, 22000L, 9450L, 
900L, 310L, 4500L, 5870L), abc = c(70L, 56L, 32L, 12L, 87L, 45L, 
312L, 23L, 71L, 145L, 7854L, 124L, 50L, 45L), def = c(65L, 56L, 
53L, 24L, 26L, 45L, 123L, 32L, 75L, 615L, 76L, 25L, 789L, 33L
)), .Names = c("No.", "County", "Route", "Number1", "abc", "def"
), row.names = c(NA, -14L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000000000b290788>, sorted = c("County", 
"Route"))