大家好我想在R中两列之间匹配时绑定两个数据帧。
X.1 <- runif(5)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")
fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
Y.1 <- runif(6)
Y.2 <- runif(6)
df <- data.frame(X.1,X.2,as.character(fruit),as.character(month))
df
X.1 X.2 as.character.fruit. as.character.month.
1 0.08694442 0.67541559 apple January
2 0.50374582 0.04485657 apple February
3 0.50482380 0.76090011 banana March
4 0.75920285 0.61077744 orange April
5 0.95243661 0.18064744 orange May
df2 <- data.frame(as.character(fruit.second),as.character(month.second),Y.1,Y.2)
df2
as.character.fruit.second. as.character.month.second. Y.1 Y.2
1 apple January 0.3407055 0.5740400
2 apple February 0.1529912 0.8163872
3 apple March 0.1042926 0.9807348
4 banana January 0.1031409 0.7961291
5 orange April 0.9537869 0.1840729
6 orange May 0.3158263 0.8856582
我想创建一个数据框,在水果 AND 月匹配时绑定两者。因此,数据框包含水果和月匹配的属性(X.1,X.2,水果,月,Y.1,Y.2)。一个例子是两个数据帧中的前两行。那些是比赛。我希望这是有道理的。
答案 0 :(得分:1)
这是一个data.table方法。如果您的真实数据集很大,那么这将比使用数据框的merge(...)
快 。
注意:重要的一位是最后的四行。另请注意,data.table并不关心fruit
和month
是因素。
set.seed(1) # for reproducible example
X.1 <- runif(5)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")
fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
Y.1 <- runif(6)
Y.2 <- runif(6)
df <- data.frame(X.1,X.2,fruit,month)
df2 <- data.frame(fruit.second,month.second,Y.1,Y.2)
## This does the work.
library(data.table)
DT1 <- data.table(df, key="fruit,month")
DT2 <- data.table(df2, key="fruit.second,month.second")
DT1[DT2,nomatch=0]
# fruit month X.1 X.2 Y.1 Y.2
# 1: apple February 0.3721239 0.94467527 0.1765568 0.9919061
# 2: apple January 0.2655087 0.89838968 0.2059746 0.7176185
# 3: orange April 0.9082078 0.62911404 0.7698414 0.9347052
# 4: orange May 0.2016819 0.06178627 0.4976992 0.2121425
这是理论上稍微提高效率的另一种方式(但在代码中稍微宽松一点)。
setkey(setDT(df),fruit,month)
setkey(setDT(df2),fruit.second,month.second)
df[df2,nomatch=0]
这种方法将df
和df2
转换为data.tables“引用”,这意味着松散地不复制。然后setkey(...)
对它们进行排序并适当地设置键。然后df[df2,...]
进行加入。使用nomatch=0
会排除键列中没有匹配值的行(内部联接,数据库术语)。
答案 1 :(得分:0)
我将fruit.second
中的month.second
和df2
的名称更改为fruit
和month
,以方便by
merge
参数1}},但是如果你不做那个改变就可以轻松做到
merge(
x=df,
y=df2,
by.x=c("fruit","month"),
by.y=c("fruit.second","month.second")
)
而不是下面的内容。
set.seed(1234)
X.1 <- runif(5)
set.seed(2345)
X.2 <- runif(5)
fruit <- c("apple","apple","banana","orange","orange")
month <- c("January","February","March","April","May")
##
fruit.second <- c("apple","apple","apple","banana","orange","orange")
month.second <- c("January","February","March","January","April","May")
set.seed(3456)
Y.1 <- runif(6)
set.seed(4567)
Y.2 <- runif(6)
##
df <- data.frame(
X.1,X.2,
fruit=as.character(fruit),
month=as.character(month),
stringsAsFactors=FALSE)
##
df2 <- data.frame(
fruit=as.character(fruit.second),
month=as.character(month.second),
Y.1,Y.2,
stringsAsFactors=FALSE)
##
merge(
df,
df2,
by=c("fruit","month")
)
##
fruit month X.1 X.2 Y.1 Y.2
1 apple February 0.6222994 0.1950251 0.7618600 0.7412554
2 apple January 0.1137034 0.1167435 0.7785807 0.2309186
3 orange April 0.6233794 0.0344546 0.5071998 0.5996399
4 orange May 0.8609154 0.4751201 0.7980290 0.2773313
答案 2 :(得分:-1)
听起来像join
操作,例如在dplyr
包中有效实施。
有4种或5种类型的连接操作,请查看正确的doc或vignette。您可能必须修改列的名称,以便使用基于列名称标识匹配的连接操作。