Question

我是R的初学者，所以如果在别处问过这个问题，我会事先道歉。这是我的问题：

我有两个数据帧，df1和df2，行数和列数不同。这两个框架只有一个共同的变量（列）称为＆＃34; customer_no＆＃34;。我希望合并的框架匹配基于＆＃34; customer_no＆＃34;的记录。以及df2中的行数。每个customer_no都有多行数据。

我尝试了以下内容：

merged.df <- (df1, df2, by="customer_no",all.y=TRUE)

问题是这会将df1的值分配给df2，而不应该为空。我的问题是：

1）如何告诉命令将不匹配的列留空？ 2）如何从合并文件中看到哪一行来自哪个df？我想如果我解决了上面的问题，空列应该很容易看到。

我在命令中遗漏了一些东西，但不知道是什么。如果这个问题已在其他地方得到解答，你是否仍然可以用英语将其改写为R初学者？

谢谢！

数据示例：

df1:
customer_no  country  year
  10           UK     2001
  10           UK     2002
  10           UK     2003
  20           US     2007
  30           AU     2006


df2:          
customer_no   income
  10            700
  10            800
  10            900 
  30            1000

合并文件应如下所示：

merged.df:
 customer_no   income  country   year
     10                  UK      2001
     10                  UK      2002
     10                  UK      2003
     10         700
     10         800
     10         900
     30                  AU      2006
     30         1000

所以：它将列全部放在一起，它在df1的最后一个基础上添加df2的值基于相同的customer_no并且仅匹配来自df2的customer_no（merged.df没有customer_no 20）。此外，它将所有其他单元格留空。

在STATA中我使用追加但不确定R ...也许加入？

谢谢！

Answer 1

尝试：

df1$id <- paste(df1$customer_no, 1, sep="_")
df2$id <- paste(df2$customer_no, 2, sep="_")

res <- merge(df1, df2, by=c('id', 'customer_no'),all=TRUE)[,-1]
res1 <- res[res$customer_no %in% df2$customer_no,]
res1
 #  customer_no country year income
 #1          10      UK 2001     NA
 #2          10      UK 2002     NA
 #3          10      UK 2003     NA
 #4          10    <NA>   NA    700
 #5          10    <NA>   NA    800
 #6          10    <NA>   NA    900
 #8          30      AU 2006     NA
 #9          30    <NA>   NA   1000

如果您想将NA更改为''，

 res1[is.na(res1)] <- '' #But, I would leave it as `NA` as there are `numeric` columns.

或者，使用rbindlist中的data.table（使用原始数据集）

 library(data.table)
 indx <- df1$customer_no %in% df2$customer_no
 rbindlist(list(df1[indx,], df2),fill=TRUE)[order(customer_no)]

 #    customer_no country year income
 #1:          10      UK 2001     NA
 #2:          10      UK 2002     NA
 #3:          10      UK 2003     NA
 #4:          10      NA   NA    700
 #5:          10      NA   NA    800
 #6:          10      NA   NA    900
 #7:          30      AU 2006     NA
 #8:          30      NA   NA   1000

Answer 2

您还可以使用smartbind包中的gtools功能。

require(gtools)
res <- smartbind(df1[df1$customer_no %in% df2$customer_no, ], df2)
res[order(res$customer_no), ]
#      customer_no country year income
#  1:1          10      UK 2001     NA
#  1:2          10      UK 2002     NA
#  1:3          10      UK 2003     NA
#  2:1          10    <NA>   NA    700
#  2:2          10    <NA>   NA    800
#  2:3          10    <NA>   NA    900
#  1:4          30      AU 2006     NA
#  2:4          30    <NA>   NA   1000

Answer 3

尝试：

df1$income = df2$country = df2$year = NA
rbind(df1, df2)
  customer_no country year income
1          10      UK 2001     NA
2          10      UK 2002     NA
3          10      UK 2003     NA
4          20      US 2007     NA
5          30      AU 2006     NA
6          10    <NA>   NA    700
7          10    <NA>   NA    800
8          10    <NA>   NA    900
9          30    <NA>   NA   1000

R发生merge / rbind / concatenate两个数据帧

3 个答案: