R:当两个条件中的任何一个匹配时合并两个数据帧

时间:2016-08-03 20:32:00

标签: r merge data-manipulation

假设我有两个数据帧,如下所示:

n = c(2, 3, 5, 5, 6, 7) 
s = c("aa", "bb", "cc", "dd", "ee", "ff") 
b = c(2, 4, 5, 4, 3, 2) 
df = data.frame(n, s, b)
#  n  s b
#1 2 aa 2
#2 3 bb 4
#3 5 cc 5  
#4 5 dd 4
#5 6 ee 3
#6 7 ff 2

n2 = c(5, 6, 7, 6) 
s2 = c("aa", "bb", "cc", "ll") 
b2 = c("hh", "nn", "ff", "dd")  
df2 = data.frame(n2, s2, b2)

 #   n2 s2 b2
 #1  5 aa hh
 #2  6 bb nn
 #3  7 cc ff
 #4  6 ll dd

我想合并它们以实现以下结果:

 #n s  b n2 s2 b2
 #2 aa 2 5  aa hh
 #3 bb 4 6  bb nn
 #5 cc 5 7  cc ff
 #5 dd 4 6  ll dd

基本上,我想要实现的是,只要在data2的s2或b2列中找到第一个数据的s中的值,就合并两个数据帧。

我知道当我从每个数据帧指定两列时合并可以工作,但我不确定如何在合并函数中添加OR条件。或者如何使用dpylr等软件包中的其他命令实现此目标。

另外,为了澄清,将存在s2和b2与同一行中的s列匹配的情况。如果是这种情况,那么只需将它们合并一次。

4 个答案:

答案 0 :(得分:3)

如果您熟悉SQL,可以使用它:

library(sqldf)
res <- sqldf("SELECT l.*, r.*
              FROM df as l
              INNER JOIN df2 as r
              on l.s = r.s2 OR l.s = r.b2")

res
  n  s b n2 s2 b2
1 2 aa 2  5 aa hh
2 3 bb 4  6 bb nn
3 5 cc 5  7 cc ff
4 5 dd 4  6 ll dd
5 7 ff 2  7 cc ff

数据

df<-structure(list(n = c(2, 3, 5, 5, 6, 7), s = structure(1:6, .Label = c("aa", 
"bb", "cc", "dd", "ee", "ff"), class = "factor"), b = c(2, 4, 
5, 4, 3, 2)), .Names = c("n", "s", "b"), row.names = c(NA, -6L
), class = "data.frame")

df2<-structure(list(n2 = c(5, 6, 7, 6), s2 = structure(1:4, .Label = c("aa", 
"bb", "cc", "ll"), class = "factor"), b2 = structure(c(3L, 4L, 
2L, 1L), .Label = c("dd", "ff", "hh", "nn"), class = "factor")), .Names = c("n2", 
"s2", "b2"), row.names = c(NA, -4L), class = "data.frame")

答案 1 :(得分:3)

一系列问题:1)你已经构建了一些数据帧,这些数据帧的因素往往会搞砸匹配和索引,因此我在数据帧调用中使用了stringsAsFactors = FALSE。 2)当s2和b2在s列中都有匹配时,你有一个模糊的情况,没有明确的解决方案(如你的例子中所示):

> df2[c("s")] <- list( c( df$s[pmax( match( df2$s2 , df$s), match(df2$b2, df$s),na.rm=TRUE)]))
> df2
  n2 s2 b2  s
1  5 aa hh aa
2  6 bb nn bb
3  7 cc ff ff
4  6 ll dd dd
> df2[c("s")] <- list( c( df$s[pmin( match( df2$s2 , df$s), match(df2$b2, df$s),na.rm=TRUE)]))
> df2
  n2 s2 b2  s
1  5 aa hh aa
2  6 bb nn bb
3  7 cc ff cc
4  6 ll dd dd

一旦你解决了satiusfaction的歧义,只需使用相同的方法来提取和匹配“b”:

> df2[c("b")] <- list( c( df$b[pmin( match( df2$s2 , df$s), match(df2$b2, df$s),na.rm=TRUE)]))
> df2
  n2 s2 b2  s b
1  5 aa hh aa 2
2  6 bb nn bb 4
3  7 cc ff cc 5
4  6 ll dd dd 4

修改后的df:

> dput(df)
structure(list(n = c(2, 3, 5, 5, 6, 7), s = c("aa", "bb", "cc", 
"dd", "ee", "ff"), b = c(2, 4, 5, 4, 3, 2)), .Names = c("n", 
"s", "b"), row.names = c(NA, -6L), class = "data.frame")
> dput(df2)
structure(list(n2 = c(5, 6, 7, 6), s2 = c("aa", "bb", "cc", "ll"
), b2 = c("hh", "nn", "ff", "dd"), s = c("aa", "bb", "cc", "dd"
), b = c(2, 4, 5, 4)), row.names = c(NA, -4L), .Names = c("n2", 
"s2", "b2", "s", "b"), class = "data.frame")

一步解决方案:

> df2[c("s", "c")] <-  df[pmin( match( df2$s2 , df$s), match(df2$b2, df$s),na.rm=TRUE), c("s", "b")]
> df2
  n2 s2 b2  s c
1  5 aa hh aa 2
2  6 bb nn bb 4
3  7 cc ff cc 5
4  6 ll dd dd 4

答案 2 :(得分:0)

一种基本方法是重新绑定两个合并。您需要在df2中重新创建相应的连接键,以有效地连接帧。此外,#5行没有出现在预期的结果中:

t1 <- merge(df, df2, by.x=c("s"), by.y=c("s2"))
t1$s2 <- t1$s

t2 <- merge(df, df2, by.x=c("s"), by.y=c("b2"))
t2$b2 <- t2$s

finaldf <- rbind(t1, t2)

#    s n b n2 b2 s2
# 1 aa 2 2  5 hh aa
# 2 bb 3 4  6 nn bb
# 3 cc 5 5  7 ff cc
# 4 dd 5 4  6 dd ll
# 5 ff 7 2  7 ff cc

答案 3 :(得分:0)

我们可以使用模糊连接,如果您有大数据,在这种情况下它可能不是很有效,但是肯定可读。使用我的软件包safejoin将{em> fuzzyjoin 包裹(在这种情况下):

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
safe_inner_join(df, df2, ~ X("s") == Y("s2") | X("s") == Y("b2"))
#   n  s b n2 s2 b2
# 1 2 aa 2  5 aa hh
# 2 3 bb 4  6 bb nn
# 3 5 cc 5  7 cc ff
# 4 5 dd 4  6 ll dd
# 5 7 ff 2  7 cc ff

fuzzyjoin 语法为:

library(fuzzyjoin)
fuzzy_inner_join(df, df2, match_fun = NULL, 
                 multi_by = list(x = "s", y= c("s2","b2")), 
                 multi_match_fun = function(x,y) x == y[,"s2"] | x == y[,"b2"])
#   n  s b n2 s2 b2
# 1 2 aa 2  5 aa hh
# 2 3 bb 4  6 bb nn
# 3 5 cc 5  7 cc ff
# 4 5 dd 4  6 ll dd
# 5 7 ff 2  7 cc ff