请看下面的R代码,我想左边连接多个变量但没有得到预期的输出...任何人都可以帮我找出下面的代码有什么问题...
====================
x <- data.frame(a= c(1, 2,3, 4, 5, 6, 7, 8, 9, 10) ,
b = c("a","b","c","d","e","f","g","h","i","j") ,
c =c(12,34,56,776,88, 99, 44, 90, 88, 55),
d =c( "AA","BB","CC", "DD","EE","FF","GG","BB","AA","BB") ,
e = c("as","saa","sxs","xxz", "dcd","cc", "ccd","xx","cdc", "hghg" ),
f = c(12, 23, 455, 44, 34, 66, 44, 55, 44,11) )
y <- data.frame(g= c(1, 2, 3 ,4, 5, 6, 7) ,
h = c("a","b","c","d","e","f","g"),
i =c(12, 88, 99, 88, 55, 44, 66) ,
j =c("AA", "EE", "FF", "AA","BB","GG","ii"),
k = c("12","34","df","56","4","fdd","ff" ))
join_df <- "select
x.* , y.i, y.k*
from x
left join y
on x.a = y.g AND x.b = y.h AND x.d = y.j"
x[order(x$a,x$b,x$d),]
y[order(y$g,y$h,y$j),]
mxy <- sqldf(join_df)
mxy
==========================
Thanks,
Naresh
答案 0 :(得分:0)
我们可以在left_join
dplyr
library(dplyr)
left_join(x, y, by = c(a="g", b="h", d="j"))
答案 1 :(得分:0)
修改join
查询,如下所示
join_df <- "select
x.* , y.i, y.k
from x left outer join y
on x.a = y.g AND x.b = y.h AND x.d = y.j"
library(sqldf)
sqldf(join_df)
# a b c d e f i k
#1 1 a 12 AA as 12 12 12
#2 2 b 34 BB saa 23 NA <NA>
#3 3 c 56 CC sxs 455 NA <NA>
#4 4 d 776 DD xxz 44 NA <NA>
#5 5 e 88 EE dcd 34 NA <NA>
#6 6 f 99 FF cc 66 NA <NA>
#7 7 g 44 GG ccd 44 NA <NA>
#8 8 h 90 BB xx 55 NA <NA>
#9 9 i 88 AA cdc 44 NA <NA>
#10 10 j 55 BB hghg 11 NA <NA>
使用base R
,我们可以使用
merge(x, y, by.x=c('a', 'b', 'd'), by.y=c('g', 'h', 'j'), all.x=TRUE)
以下基准测试显示了性能与不同join
方法的比较,dplyr join
是最快的。
library('microbenchmark')
microbenchmark(sqldf_join=sqldf(join_df),
merge=merge(x, y, by.x=c('a', 'b', 'd'), by.y=c('g', 'h', 'j'), all.x=TRUE),
dplyr_join=left_join(x, y, by = c("a"="g", "b"="h", "d"="j")), times=100)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# sqldf_join 14068.131 14593.290 15520.0908 15072.2635 16411.678 19768.767 100 c
# merge 1590.021 1669.992 1770.4099 1724.3045 1772.201 3834.781 100 b
# dplyr_join 506.343 561.298 608.4422 607.0565 646.187 837.349 100 a