dplyr full_join无法按预期工作

时间:2015-05-22 12:04:49

标签: r dplyr

这是一个玩具示例(其中merge来自基本包,而来自dplyr):

require(dplyr)
a = data.frame(Day=Sys.Date()+1:5,x=1:5)
b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5))

x1 = b
x2 = b
for(i in 1:10){
   x1=full_join(x1,a,by="Day")
   x2 = merge(x2,a,by="Day",all=T)
}

x1和x2不同。我期待x2,因为" a"附加到最后。 这是x2(前5行):

2015-05-14 15 NA NA NA NA NA NA NA NA NA NA

2015-05-15 12 NA NA NA NA NA NA NA NA NA NA

2015-05-16 9 NA NA NA NA NA NA NA NA NA NA

2015-05-17 6 NA NA NA NA NA NA NA NA NA NA

但是full_join中的x1是:

Day x.x x.y x.x x.y x.x x.y x.x x.y x.x x.y x

1 2015-05-18 3 NA 3 NA 3 NA 3 NA 3 NA NA

2 2015-05-17 6 NA 6 NA 6 NA 6 NA 6 NA NA

3 2015-05-16 9 NA 9 NA 9 NA 9 NA 9 NA NA

这是一个错误吗?或者这是预期的吗?我希望merge(x2)的输出在逻辑上是正确的....我希望x2使用dplyr full_join。有办法吗?

1 个答案:

答案 0 :(得分:0)

如果重命名数据框a中的列,则两种方法的行为相同:

require(dplyr)
a = data.frame(Day=Sys.Date()+1:5,y=1:5)
b = data.frame(Day=Sys.Date()-1:5,x=3*(1:5))

x1 = b
x2 = b
for(i in 1:10){
  x1=full_join(x1,a,by="Day")
  x2=merge(x2,a,by="Day",all=T)
}

# fix up the column names...
names(x1) <- sapply(1:ncol(x1), function(x) {paste0("V", x)})
names(x2) <- sapply(1:ncol(x2), function(x) {paste0("V", x)})

x1 %>% arrange(desc(V1))
x2 %>% arrange(desc(V1))

所以我在这里改变了这一行:

a = data.frame(Day=Sys.Date()+1:5,x=1:5)

a = data.frame(Day=Sys.Date()+1:5,y=1:5)

为什么会这样?当您运行上面提供的代码时,您应该实际收到警告消息。在我的R版本上,我得到以下内容:

Warning messages:
1: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’ are duplicated in the result
2: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’ are duplicated in the result
3: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
4: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
5: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
6: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
7: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result
8: In merge.data.frame(x2, a, by = "Day", all = T) :
  column names ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’, ‘x.x’, ‘x.y’ are duplicated in the result

所以我认为,full_joinmerge的结果在这种情况下不匹配的原因是因为您提供的两个数据框中的列不明确。当你消除这种歧义时,结果会按预期匹配,所以我不认为这是一个错误。