基于两列的交集绑定两个数据帧

时间:2014-07-21 21:17:10

标签: r dataframe

大家好我想在R中两列之间匹配时绑定两个数据帧。

X.1 <- runif(5)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")


fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
Y.1 <- runif(6)
Y.2 <- runif(6)

df <- data.frame(X.1,X.2,as.character(fruit),as.character(month))
df


        X.1        X.2 as.character.fruit. as.character.month.
1 0.08694442 0.67541559               apple             January
2 0.50374582 0.04485657               apple            February
3 0.50482380 0.76090011              banana               March
4 0.75920285 0.61077744              orange               April
5 0.95243661 0.18064744              orange                 May  



df2 <- data.frame(as.character(fruit.second),as.character(month.second),Y.1,Y.2)
df2

  as.character.fruit.second. as.character.month.second.       Y.1       Y.2
1                      apple                    January 0.3407055 0.5740400
2                      apple                   February 0.1529912 0.8163872
3                      apple                      March 0.1042926 0.9807348
4                     banana                    January 0.1031409 0.7961291
5                     orange                      April 0.9537869 0.1840729
6                     orange                        May 0.3158263 0.8856582

我想创建一个数据框,在水果 AND 月匹配时绑定两者。因此,数据框包含水果月匹配的属性(X.1,X.2,水果,月,Y.1,Y.2)。一个例子是两个数据帧中的前两行。那些是比赛。我希望这是有道理的。

3 个答案:

答案 0 :(得分:1)

这是一个data.table方法。如果您的真实数据集很大,那么这将比使用数据框的merge(...)

注意:重要的一位是最后的四行。另请注意,data.table并不关心fruitmonth是因素。

set.seed(1)   # for reproducible example
X.1 <- runif(5)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")

fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
Y.1 <- runif(6)
Y.2 <- runif(6)

df <- data.frame(X.1,X.2,fruit,month)
df2 <- data.frame(fruit.second,month.second,Y.1,Y.2)

## This does the work.
library(data.table)
DT1 <- data.table(df,  key="fruit,month")
DT2 <- data.table(df2, key="fruit.second,month.second")
DT1[DT2,nomatch=0]
#     fruit    month       X.1        X.2       Y.1       Y.2
# 1:  apple February 0.3721239 0.94467527 0.1765568 0.9919061
# 2:  apple  January 0.2655087 0.89838968 0.2059746 0.7176185
# 3: orange    April 0.9082078 0.62911404 0.7698414 0.9347052
# 4: orange      May 0.2016819 0.06178627 0.4976992 0.2121425

这是理论上稍微提高效率的另一种方式(但在代码中稍微宽松一点)。

setkey(setDT(df),fruit,month)
setkey(setDT(df2),fruit.second,month.second)
df[df2,nomatch=0]

这种方法将dfdf2转换为data.tables“引用”,这意味着松散地不复制。然后setkey(...)对它们进行排序并适当地设置键。然后df[df2,...]进行加入。使用nomatch=0会排除键列中没有匹配值的行(内部联接,数据库术语)。

答案 1 :(得分:0)

我将fruit.second中的month.seconddf2的名称更改为fruitmonth,以方便by merge参数1}},但是如果你不做那个改变就可以轻松做到

merge(
  x=df,
  y=df2,
  by.x=c("fruit","month"),
  by.y=c("fruit.second","month.second")
)

而不是下面的内容。

set.seed(1234)
X.1 <- runif(5)
set.seed(2345)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")
##
fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
set.seed(3456)
Y.1 <- runif(6)
set.seed(4567)
Y.2 <- runif(6)
##
df <- data.frame(
  X.1,X.2,
  fruit=as.character(fruit),
  month=as.character(month),
  stringsAsFactors=FALSE)
##
df2 <- data.frame(
  fruit=as.character(fruit.second),
  month=as.character(month.second),
  Y.1,Y.2,
  stringsAsFactors=FALSE)
##
merge(
  df,
  df2,
  by=c("fruit","month")
)
##
   fruit    month       X.1       X.2       Y.1       Y.2
1  apple February 0.6222994 0.1950251 0.7618600 0.7412554
2  apple  January 0.1137034 0.1167435 0.7785807 0.2309186
3 orange    April 0.6233794 0.0344546 0.5071998 0.5996399
4 orange      May 0.8609154 0.4751201 0.7980290 0.2773313

答案 2 :(得分:-1)

听起来像join操作,例如在dplyr包中有效实施。

有4种或5种类型的连接操作,请查看正确的doc或vignette。您可能必须修改列的名称,以便使用基于列名称标识匹配的连接操作。