通过列的完全匹配合并数据帧

时间:2018-10-03 20:20:37

标签: r dataframe merge conditional match

我想合并两个数据框,其中一个具有更多变量(列),而另一个具有更多观察值(行)。下面是它们的设置方式的简化示例:

数据框1:

ID      Date         Indicator
12345   01/01/2008   1
54321   12/01/2008   1

数据框2:

ID      Date         
12345   01/01/2008   
12345   01/31/2008
12345   02/28/2009
24681   01/01/2008
54321   12/01/2008
54321   12/20/2008

我想做的只是保留ID完全匹配的行。例如,我想要以下输出:

新数据框:

ID      Date         Indicator     
12345   01/01/2008   1
12345   01/31/2008   NA
12345   02/28/2009   NA
54321   12/01/2008   1
54321   12/20/2008   NA

我尝试过

new <- merge(df1, df2, all=TRUE)

但这会导致所有行合并,而我只希望df2中ID为df1的行。

感谢您的帮助!

7 个答案:

答案 0 :(得分:2)

您可以尝试使用x解决方案:

dplyr

因此您将没有library(dplyr) # a right join when you filter Dataframe2 by ID in Dataframe1 Dataframe1 %>% right_join(Dataframe2[Dataframe2$ID %in% Dataframe1$ID,]) Joining, by = c("ID", "Date") ID Date Indicator 1 12345 01/01/2008 1 2 12345 01/31/2008 NA 3 12345 02/28/2009 NA 4 54321 12/01/2008 1 5 54321 12/20/2008 NA # clearly you can put it in a data.frame Dataframe3 <- Dataframe1 %>% right_join(Dataframe2[Dataframe2$ID %in% Dataframe1$ID,], by = 'ID') %>% data.frame() 24681,并且在ID中将NA视为必需,即Indicator不需要它。


您的数据:

Date

答案 1 :(得分:1)

您可以尝试使用层库中的join()函数。您还需要执行额外的步骤才能获得所需的确切输出。

library(plyr)

df1

     ID       Date Indicator
1 12345 2020-01-01         1
2 54321 2020-12-01         1

 df2

     ID       Date
1 12345 2020-01-01
2 12345 2020-01-31
3 12345 2020-02-28
4 24681 2020-01-01
5 54321 2020-12-01
6 54321 2020-12-20

# that extra step
df3 <- df2[df2$ID %in% df1$ID,]
df3
     ID       Date
1 12345 2020-01-01
2 12345 2020-01-31
3 12345 2020-02-28
5 54321 2020-12-01
6 54321 2020-12-20

join(df3, df1, by = c("ID", "Date"))
     ID       Date Indicator
1 12345 2020-01-01         1
2 12345 2020-01-31        NA
3 12345 2020-02-28        NA
4 54321 2020-12-01         1
5 54321 2020-12-20        NA

答案 2 :(得分:0)

根据s_t的评论进行编辑:

left_join(df2, df1, by=c("ID", "Date")) %>% filter(ID %in% df1$ID)

答案 3 :(得分:0)

mergesubset一起考虑:

df3 <- subset(merge(df1, df2, by=c("ID", "Date"), all=TRUE), ID %in% df1$ID)

df3
#      ID       Date Indicator
# 1 12345 01/01/2008         1
# 2 12345 01/31/2008        NA
# 3 12345 02/28/2009        NA
# 5 54321 12/01/2008         1
# 6 54321 12/20/2008        NA

要重置row.names,请包装data.frame()构造函数并指定行名参数:

df3 <- data.frame(subset(merge(df1, df2, by=c("ID", "Date"), all=TRUE),
                         ID %in% df1$ID),
                  row.names = NULL)

df3
#      ID       Date Indicator
# 1 12345 01/01/2008         1
# 2 12345 01/31/2008        NA
# 3 12345 02/28/2009        NA
# 4 54321 12/01/2008         1
# 5 54321 12/20/2008        NA

答案 4 :(得分:0)

只需尝试:

library(dplyr)
df2 %>%
  left_join(df1, by = c("ID", "Date")) %>% # or full_join(df1, by = c("ID", "Date"))
  filter(ID %in% df1$ID) 

或者根据您的开始:

merge(df1, df2, all = TRUE) %>% filter(ID %in% df1$ID)

答案 5 :(得分:0)

如果数据大小不太大,则可以添加一行,以df1 $ id过滤结果。

<code-example>

答案 6 :(得分:0)

联接是您要寻找的。如果您打算保留作为参考的表在左侧,则其为左连接。样例代码

    df1<-data.frame(ID=c(12345,54321) ,Date  =c('01/01/2008',' 12/01/2008 ')   ,    
     Indicator=c(1,1))

     df2<-data.frame(ID=c(12345,12345,5341) ,Date  =c('01/01/2008',' 12/01/2008 
      ','12/1/2008') )

    merge(df1,df2,by.x = 'ID',by.y='ID')

      ID     Date.x       Indicator       Date.y
      12345 01/01/2008         1    01/01/2008
      12345 01/01/2008         1    12/01/2008 

因此,仅df 2中存在的df1行是输出的一部分