Question

我怎样才能，来自以下两个数据框：

DF1：

cat1        cat2
a           NA
b           NA
c           NA
d           NA
e           NA

DF2：

cat1        cat2
c           1
d           2

尽可能高效地生成以下结果？

cat1        cat2
a           NA
b           NA
c           1
d           2
e           NA

当我这样做时：

df3 <- rbind(df2,df1[!(df1$cat1 %in% df2$cat1),])
merge(df1,df3,all.y=TRUE)

我得到了所需的数据帧。但是，有没有更整洁，也许更有效的方法来做到这一点？（这只是虚拟数据 - 实际上我有 700k数据线）

Answer 1

这个怎么样：

df1 <-read.table(text="cat1        cat2
a           NA
b           NA
c           NA
d           NA
e           NA",header=TRUE,stringsAsFactors=FALSE)

df2<-read.table(text="cat1        cat2
c           1
d           2",header=TRUE, stringsAsFactors=FALSE)

df1[df1$cat1%in%df2$cat1,] <-df2

  cat1 cat2
1    a   NA
2    b   NA
3    c    1
4    d    2
5    e   NA

修改

我在你的解决方案上运行microbenchmark，我的解决方案和另一个答案中的data.table解决方案，我的到目前为止是最快的。

进行1000次df1[df1$cat1%in%df2$cat1,] <-df2计算的平均时间为80微秒。相比之下，data.table解决方案为690微秒，解决方案总共为1062微秒。如此有效，我的解决方案速度提高了一个数量级。

library(microbenchmark) res <- microbenchmark( rbind(df2,df1[!(df1$cat1 %in% df2$cat1),]), merge(df1,df3,all.y=TRUE), df1[df1$cat1%in%df2$cat1,] <-df2, dat1[dat][,1:2,with=T], times=1000L) > print(res) Unit: microseconds expr min lq *mean* median uq max neval rbind(df2, df1[!(df1$cat1 %in% df2$cat1), ]) 242.395 260.3555 279.3699 268.3550 277.5615 2817.263 1000 merge(df1, df3, all.y = TRUE) 679.488 724.1640 783.2416 740.1625 761.5940 6756.541 1000 df1[df1$cat1 %in% df2$cat1, ] <- df2 63.392 72.1450 80.0050 75.1640 80.5975 2017.334 1000 dat1[dat][, 1:2, with = T] 602.816 649.6040 690.9846 665.3010 691.2615 3264.319 1000

<强> EDIT2

另一个microbenchmark，包含100,000个数据点，并包含setkeyv步骤data.table。基本索引（df[df$cat1 %in% df1$cat1, ] <- df）比data.table（7.4毫秒）的总步数略快（平均7毫秒）但不多。效率取决于OP的实际数据集。

library(data.table) dat <- data.table(cat1=c(paste0("a",1:100000)),cat2=rep(NA,100000)) dat1 <- data.table(cat1=c(paste0("a",sample(1:100000,10001))),cat2=1:10001) setkeyv(dat,"cat1") setkeyv(dat1,"cat1") df <- data.frame(dat) df1 <- data.frame(dat1) library(microbenchmark) res <- microbenchmark( merge(df,df1,all.y=TRUE), df[df$cat1 %in% df1$cat1, ] <- df1, setkeyv(dat,"cat1"), setkeyv(dat1,"cat1"), dat1[dat][,1:2,with=T], times=100L) print(res) Unit: microseconds expr min lq mean median uq max neval cld merge(df, df1, all.y = TRUE) 96573.600 98317.435 115509.544 102872.81 130325.979 195910.42 100 d df[df$cat1 %in% df1$cat1, ] <- df1 4329.293 4785.601 7059.100 5054.74 5632.501 40521.16 100 c setkeyv(dat, "cat1") 1166.073 1568.211 1928.071 1766.36 1913.329 14256.59 100 ab setkeyv(dat1, "cat1") 215.253 296.935 434.589 443.05 506.629 1279.54 100 a dat1[dat][, 1:2, with = T] 3531.004 4020.242 5024.882 4195.72 4587.026 34787.45 100 bc

有效地将数据帧与NA进行合并

1 个答案: