我尝试根据2列合并4个数据帧,但要跟踪列源自哪个数据帧。我在跟踪列时遇到了问题。
(参见dput(dfs)帖子的结尾)
#df example (df1)
Name Color Freq
banana yellow 3
apple red 1
apple green 4
plum purple 8
#create list of dataframes
list.df <- list(df1, df2, df3, df4)
#merge dfs on column "Name" and "Color"
combo.df <- Reduce(function(x,y) merge(x,y, by = c("Name", "Color"), all = TRUE, accumulate=FALSE, suffixes = c(".df1", ".df2", ".df3", ".df4")), list.df)
这会发出以下警告:
警告讯息: 在merge.data.frame(x,y,by = c(&#34; Name&#34;,&#34; Color&#34;)中,all = TRUE,: 列名'Freq.df1','Freq.df2'在结果中重复
并输出此数据帧:
#combo df example
Name Color Freq.df1 Freq.df2 Freq.df1 Freq.df2
banana yellow 3 3 7 NA
apple red 1 2 9 1
apple green 4 NA 8 2
plum purple 8 1 NA 6
df1
和df2
仅在名称中重复出现。填充combo
第三和第四列的值实际上分别来自df3
和df4
。
我真正想要的是:
Name Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
banana yellow 3 3 7 NA
apple red 1 2 9 1
apple green 4 NA 8 2
plum purple 8 1 NA 6
我怎样才能做到这一点?我知道merge(..., suffixes)
函数只能处理2的字符向量,但我不知道应该是什么。谢谢!
df1 <-
structure(list(Name = structure(c(2L, 1L, 1L, 3L), .Label = c("apple",
"banana", "plum"), class = "factor"), Color = structure(c(4L,
3L, 1L, 2L), .Label = c("green", "purple", "red", "yellow"), class = "factor"),
Freq = c(3, 1, 4, 8)), .Names = c("Name", "Color", "Freq"
), row.names = c(NA, -4L), class = "data.frame")
df2 <-
structure(list(Name = structure(c(2L, 1L, 3L), .Label = c("apple",
"banana", "plum"), class = "factor"), Color = structure(c(3L,
2L, 1L), .Label = c("purple", "red", "yellow"), class = "factor"),
Freq = c(3, 2, 1)), .Names = c("Name", "Color", "Freq"), row.names = c(NA,
-3L), class = "data.frame")
df3 <-
structure(list(Name = structure(c(2L, 1L, 1L), .Label = c("apple",
"banana"), class = "factor"), Color = structure(c(3L, 2L, 1L), .Label = c("green",
"red", "yellow"), class = "factor"), Freq = c(7, 9, 8)), .Names = c("Name",
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")
df4 <-
structure(list(Name = structure(c(1L, 1L, 2L), .Label = c("apple",
"plum"), class = "factor"), Color = structure(c(3L, 1L, 2L), .Label = c("green",
"purple", "red"), class = "factor"), Freq = c(1, 2, 6)), .Names = c("Name",
"Color", "Freq"), row.names = c(NA, -3L), class = "data.frame")
答案 0 :(得分:1)
for
循环似乎更容易,因为Reduce
或reduce
(purrr
)一次只需要两个数据集,因此我们可以&#39; t suffixes
中有两个以上merge
。
在这里,我们创建了一个后缀向量(&#39; sfx&#39;)。使用第一个list
元素初始化输出数据集。然后循环遍历&list; list.df&#39;并使用&#39; res&#39;进行顺序merge
。以及list.df
的下一个元素,同时更新&#39; res&#39;在每一步
sfx <- c(".df1", ".df2", ".df3", ".df4")
res <- list.df[[1]]
for(i in head(seq_along(list.df), -1)) {
res <- merge(res, list.df[[i+1]], all = TRUE,
suffixes = sfx[i:(i+1)], by = c("Name", "Color"))
}
res
# Name Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1 apple green 4 NA 8 2
#2 apple red 1 2 9 1
#3 banana yellow 3 3 7 NA
#4 plum purple 8 1 NA 6
答案 1 :(得分:1)
我终于可以使用Reduce
函数本身。为此,我以特定格式修改了输入。
由于我们无法在data.frame
函数中传递Reduce
作为参数的名称,因此我创建了一个包含data.frame名称的属性n
的列表。
lst=list(list(n="df1",df=df1),list(n="df2",df=df2),list(n="df3",df=df3), list(n="df4",df=df4))
我已经构建了逻辑来跟踪正在处理的data.frames
的名称。
Reduce(function(x,y){
if(ncol(x$df)==3){
#df column names after 1st merge.
namecol=c('Name','Color',paste0("Freq.",x$n),paste0("Freq.",y$n))
}else{
#df column names for remaining merges.
namecol=c(colnames(x$df),paste0("Freq.",y$n))
}
df=merge.data.frame(x = x$df,y = y$df,by = c("Name","Color"),all = TRUE)
colnames(df)=namecol
list(n="df",df=df)},lst)
#$n
#[1] "df"
#$df
# Name Color Freq.df1 Freq.df2 Freq.df3 Freq.df4
#1 apple green 4 NA 8 2
#2 apple red 1 2 9 1
#3 banana yellow 3 3 7 NA
#4 plum purple 8 1 NA 6
答案 2 :(得分:0)
我的包safejoin的功能eat
具有这样的功能,如果您给
它是data.frames的命名列表作为第二个输入,它将加入它们
递归到第一个输入,新输入使用此名称作为前缀。
我们将不得不分别重命名。
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
eat(rename(df1,df1_Freq = Freq), lst(df2,df3,df4),
.by = c("Name","Color"), .mode= "full",.check="")
# Name Color df1_Freq df2_Freq df3_Freq df4_Freq
# 1 banana yellow 3 3 7 NA
# 2 apple red 1 2 9 1
# 3 apple green 4 NA 8 2
# 4 plum purple 8 1 NA 6
.mode = "full"
用于建立完整的外部联接,尽管此处的默认联接(左联接)给出的结果相同。
.check = ""
是要删除检查,这将警告连接列之间的因素级别不同。