Question

我有一个大数据集。

当我使用mapply时，我有时间运行一个实例的代码（我有40万个实例）

user system elapsed 0.49 0.05 0.53 函数需要2个参数作为输入。

从this链接

获得了这个想法

是否存在有效运行代码的应用函数思想？

编辑：为了更好地了解代码

`A$V1<- sample(50000)
 A$V2<- sample(50000)
output<-mapply(myfun, A$V1, A$V2)

myfun<- function(x,y)
return(length(which(x<=gh2$data_start & y>=gh2$data_end)))`

gh2是一个10亿行的数据框。对于沿着这个大gh2数据帧的一次搜索，哪个函数本身消耗0.30秒。意图是找到有多少行属于这种情况还有其他有效的方法吗？

Answer 1

你仍然没有告诉我们足以复制你的问题，但也许下面的例子可行。 tl; dr 我可以用sum()代替length(which())来节省大约10％（我很惊讶它不是更多......）并得到5倍使用Rcpp加速。

生成示例数据：

set.seed(101)
n1 <- 1e4; n2 <- 1e3  
gh2 <- data.frame(data_start=rnorm(n1),data_end=rnorm(n1))

从tbl_df尝试常规数据框和dplyr（同样，data_frame对于生成数据更为方便，因为它允许实时转换。）

library("dplyr")
A <- data_frame(V1=rnorm(n2),
                V2=V1+runif(n2))
A0 <- as.data.frame(A)

使用sum()的原始函数和base-R替代：

fun1 <- function(x,y)
    return(length(which(x<=gh2$data_start & y>=gh2$data_end)))
fun2 <- function(x,y)
    return(sum(x<=gh2$data_start & y>=gh2$data_end))

检查：

all.equal(with(A0, mapply(fun1, V1, V2)),
          with(A, mapply(fun2, V1, V2)))  ## TRUE

现在是Rcpp版本。这几乎肯定会缩短/变得更加滑动，但我对这个框架并不是很有经验（不太可能产生巨大的速度差异）。

library("Rcpp")
cppFunction("
NumericVector fun3(NumericVector d_start, NumericVector d_end,
                     NumericVector lwr, NumericVector upr) {
   int i, j;
   int n1 = lwr.size();
   int n2 = d_start.size();

   NumericVector res(n1);

   for (i=0; i<n1; i++) {
       res[i]=0;
       for (j=0; j<n2; j++) {
            if (lwr[i]<=d_start[j] && upr[i]>=d_end[j]) res[i]++;
       }
   }
   return res;
}
")

检查：

f3 <- fun3(gh2$data_start,gh2$data_end, A$V1,A$V2)
f1 <- with(A0, mapply(fun1, V1, V2))
all.equal(f1,f3)  ## TRUE

基准：

library(rbenchmark)
benchmark(fun1.0= with(A0, mapply(fun1, V1, V2)),
          fun2.0= with(A0, mapply(fun2, V1, V2)),  ## data.frame
          fun2  = with(A, mapply(fun2, V1, V2)),   ## dplyr-style
          fun3 = fun3(gh2$data_start,gh2$data_end, A$V1,A$V2),
          columns=c("test", "replications", "elapsed", "relative"),
          replications=30
          )
##     test replications elapsed relative
## 1 fun1.0           30   7.813    5.699
## 3   fun2           30   6.834    4.985
## 2 fun2.0           30   6.841    4.990
## 4   fun3           30   1.371    1.000

data.frame和tbl_df
sum()比length(which())
Rcpp比基础R快约5倍

这原则上可以与parallel::mcmapply：

结合使用

mcmapply(fun3,gh2$data_start,gh2$data_end, A$V1,A$V2,
                  mc.cores=4)

但是对于上面示例中的大小，开销太高而不值得。

适用于需要多个参数的大型数据集的最佳应用函数

1 个答案: