如果不匹配x2,或者如果不匹配x3,则合并到x1

时间:2018-10-03 20:07:08

标签: r merge

我正在尝试在一个键上合并2个数据集,但是如果没有匹配项,那么我想尝试另一个键,依此类推。

df1 <- data.frame(a=c(5,1,7,3),
              b=c("T","T","T","F"),
              c=c("F","T","F","F"))


df2 <- data.frame(x1=c(4,5,3,9), 
                  x2=c(7,8,1,2),
                  x3=c("g","w","t","o"))
df1
   a  b  c
1  5  T  F
2  1  T  T
3  7  T  F
4  3  F  F

df2
   x1 x2 x3 ..
1  4  7  g  ..
2  5  8  w  ..
3  3  1  t  ..
4  9  2  o  ..

所需的输出类似

   a  b  c x3  ..
1  5  T  F  w  ..
2  1  T  T  t  ..
3  7  T  F  g  ..
4  3  F  F  t  ..

我尝试了类似的方法

dfm <- merge(df1,df2, by.x = "a", by.y = "x1", all.x = TRUE)
dfm <- merge(dfm,df2, by.x = "a", by.y = "x2", all.x = TRUE)

但是那不是很正确。

4 个答案:

答案 0 :(得分:3)

这实际上不是标准的合并方式。您可以通过重塑df2使其更标准,以便仅合并一个字段

df2long <- rbind(
    data.frame(a = df2$x1, df2[,-(1:2), drop=FALSE]), 
    data.frame(a = df2$x2, df2[,-(1:2), drop=FALSE])
)
dfm <- merge(df1, df2long, by = "a", all.x = TRUE)

答案 1 :(得分:3)

您可以执行以下操作:

matches <- lapply(df2[, c("x1", "x2")], function(x) match(df1$a, x))
# finding matches in df2$x1 and df2$x2
# notice that the code below should work with any number of columns to be matched:
# you just need to add the names here eg. df2[, paste0("x", 1:100)] 
matches
$x1
[1]  2 NA NA  3

$x2
[1] NA  3  1 NA
combo <- Reduce(function(a,b) "[<-"(a, is.na(a), b[is.na(a)]), matches)
# combining the matches on "first come first served" basis
combo
[1] 2 3 1 3
cbind(df1, df2[combo,])
    a b c x1 x2 x3
2   5 T F  5  8  w
3   1 T T  3  1  t
1   7 T F  4  7  g
3.1 3 F F  3  1  t

答案 2 :(得分:2)

如果我理解正确,则OP要求先尝试将ax1进行匹配,然后-如果失败,则尝试将ax2进行匹配。因此,ax1的任何匹配都应优先于ax2的匹配。

不幸的是,OP提供的样本数据集 not 并不包含用例来证明这一点。因此,我已经相应地修改了示例数据集(请参见数据部分)。

这里建议的方法是将df2从宽格式重整为长格式(也可能是MrFlick's answer),但要使用带有参数data.table的{​​{1}}连接。

mult = "first"的列被视为关键列和优先级,可以通过df2的{​​{1}}参数进行控制。重塑后,measure.vars按照melt()中给出的列顺序排列行:

melt()
measure.vars

请注意,library(data.table) # define cols of df2 to use as key in order of key_cols <- c("x1", "x2") # reshape df2 from wide to long format long <- melt(setDT(df2), measure.vars = key_cols, value.name = "a") # join long with df1, pick first matches result <- long[setDT(df1), on = "a", mult = "first"] # clean up setcolorder(result, names(df1)) result[, variable := NULL] result 的原始行顺序已保留。

此外,请注意,该代码适用于任意数量的键列。键列的优先级可以轻松更改。例如,如果顺序相反,即 a b c x3 1: 5 T F w 2: 1 T T t 3: 7 T F g 4: 3 F F t 5: 0 F F <NA> df1的{​​{1}}匹配项将被优先选择。

数据

增强的样本数据集:

key_cols <- c("x2", "x1")的另一行在a中不匹配。

x2
df1

df2还有另外一行,以证明df1 <- data.frame(a=c(5,1,7,3,0), b=c("T","T","T","F","F"), c=c("F","T","F","F","F")) df1 中的匹配项优先于 a b c 1: 5 T F 2: 1 T T 3: 7 T F 4: 3 F F 5: 0 F F 中的匹配项。值df2出现两次:在x1列的第2行和x2列的第5行中。

5
x1

答案 3 :(得分:0)

不确定我是否理解您的问题,而不是重复合并,而是比较可能合并的键(如果此数字> 0,则表示您有匹配项)。如果您想在第一列中找到匹配项,可以尝试以下操作:

    library(tidyr)
    library(purrr)
    (df1 <- data.frame(a=c(5,1,7,3),
          b=c("T","T","T","F"),
          c=c("F","T","F","F")) )
    (df2 <- data.frame(x1=c(4,5,3,9), 
              x2=c(7,8,1,2),
              x3=c("g","w","t","o")) )

     FirstColMatch<-1:ncol(df2) %>% 
         map(~intersect(df1$a, df2[[.x]])) %>% 
         map(length)  %>%
         detect_index(function(x)x>0)

     NewDF<-merge(df1,df2,by.x="a", by.y =names(df2)[FirstColMatch])