如何合并两个数据框并仅保留不同的列(内容)?

时间:2020-10-14 12:16:04

标签: r dataframe merge

我有两个具有相同行大小和不同列号的数据框,列的名称也不同,但是其中一些内容可能相似。

即df1:

df1<- data.frame("a"=c("0","1","0","1","0","0","0"),
                "b"=c("1","1","1","1","1","0","0"),
                "c"=c("1","1","0","0","1","0","0"),
                "d"=c("1","1","1","1","1","1","1"))

df2:

df2<- data.frame("e"=c("1","1","0","1","0","0","0"),
                "f"=c("1","1","1","1","1","0","0"),
                "g"=c("0","0","0","0","1","0","0"),
                "h"=c("0","0","0","0","1","1","1"))

如果看到的话,df1的“ b”列和df2的“ f”列相等。因此,我想要的结果是一个新的数据框,如下所示:

df3 <- data.frame("a"=c("0","1","0","1","0","0","0"),
                  "c"=c("1","1","0","0","1","0","0"),
                  "d"=c("1","1","1","1","1","1","1"),
                  "e"=c("1","1","0","1","0","0","0"),
                  "g"=c("0","0","0","0","1","0","0"),
                  "h"=c("0","0","0","0","1","1","1"))

注意:列“ b”和“ f”(相似)不在新df3中。 我在网上看过,但是没有找到一个例子。我认为主要的复杂性是合并是按内容而不是列名进行的。

4 个答案:

答案 0 :(得分:1)

我们可以使用sapply来检查完全匹配的列。

mat <- sapply(df1, function(x) sapply(df2, function(y) all(x == y)))
mat

#      a     b     c     d
#e FALSE FALSE FALSE FALSE
#f FALSE  TRUE FALSE FALSE
#g FALSE FALSE FALSE FALSE
#h FALSE FALSE FALSE FALSE

在这里我们可以看到b中的df1列和f中的df2列应被删除。我们可以通过:

m2 <- which(mat, arr.ind = TRUE)
cbind(df1[-m2[, 2]], df2[-m2[, 1]])

#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

答案 1 :(得分:1)

这可以完成工作:

df3 <- cbind(df1,df2)
df3 <- t(t(df3)[!(duplicated(t(df3)) | duplicated(t(df3), fromLast = TRUE)),])
df3

#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

这将为您提供matrix,如果需要,您可以将结果另存为df

答案 2 :(得分:1)

我们可以使用outer中的base R

mat <- outer(df1, df2, FUN = Vectorize(function(x, y) all(x == y)))
mat
#      e     f     g     h
#a FALSE FALSE FALSE FALSE
#b FALSE  TRUE FALSE FALSE
#c FALSE FALSE FALSE FALSE
#d FALSE FALSE FALSE FALSE

现在,我们可以获取行/列的名称

m2 <- as.matrix(subset(as.data.frame.table(mat), Freq, select = -Freq))

现在,我们使用'm2'从'df1','df2'和cbind删除列名

cbind(df1[setdiff(names(df1), m2[,1])], df2[setdiff(names(df2), m2[,2])])
#  a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

答案 3 :(得分:1)

这是一个更tidyverse的解决方案。

library(dplyr)
library(tidyr)
# based on Ronak's sapply approach
matches <- as.data.frame(sapply(df1, function(x) sapply(df2, function(y) identical(x, y)))) %>%
  rownames_to_column(var = "df2") %>%
  pivot_longer(-df2, names_to = "df1") %>% # pivot longer
  filter(value) # keep only the matches

# programmatically build list of names to remove
vars_remove <- c(matches$df1, matches$df2) # will remove var names that are matches
df1 %>% bind_cols(df2) %>%
  select(-any_of(vars_remove))

  a c d e g h
1 0 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 1 0 0 0
4 1 0 1 1 0 0
5 0 1 1 0 1 1
6 0 0 1 0 0 1
7 0 0 1 0 0 1