Question

我需要从23Mil数据集中提取2Mil观测值。使用下面的代码需要花费大量时间来完成它。在具有16GB RAM的Xeon CPU上，它在12小时后仍然运行。我还注意到CPU只运行25％，HD运行率为43％。如何使采样过程更快？附件是使用

的两行代码

prb <- ifelse(dat$target=='1', 1.0, 0.05)
smpl <- dat[sample(nrow(dat), 2000000, prob = prb), ]

Answer 1

使用不等概率和sample调用的replace = FALSE函数，可能并不完全按照您的意愿执行：它绘制一个样本，然后重新计算剩余概率，以便它们加起来一个，然后再画一个样本等。这样做很慢，概率不再与原作相匹配。

在您的情况下，一种解决方案是将您的数据集分成两部分（目标==＆＃39; 1＆＃39;和目标！=＆＃39; 1＆＃39;）并为每个计算单独的样本。您只需计算要在每个组中选择的元素数量。

另一种解决方案是使用sampling包中的采样方法。例如，systematic sampling：

library(sampling)

nsample <- 2E6

# Scale probabilities: add up to the number of elements we want
prb <- nsample/sum(prb) * prb

# Sample
smpl <- UPrandomsystematic(prb)

我的系统大约需要3秒钟。

检查输出：

> t <- table(smpl, prb)
> sum(smpl)
[1] 2e+06
> t[2,2]/t[2,1]
[1] 19.96854

我们确实选择了2E6个记录，target == 1的包含概率比target != 1小20倍。

Answer 2

瓶颈来自于样品，正如Jan van der Laan所提到的那样。

当您需要在没有替换的情况下进行采样（以及当尺寸至少比初始尺寸小5倍）时，采用拒绝进行采样的解决方案。您可以使用所需数量的两倍替换样本，并仅获取第一个唯一值的数量。

N <- 23e6
dat <- data.frame(
  target = sample(0:1, size = N, replace = TRUE),
  x = rnorm(N)
)      
prb <- ifelse(dat$target == 1, 1.0, 0.05)
n <- 2e6

Rcpp::sourceCpp('sample-fast.cpp')
sample_fast <- function(n, prb) {
  N <- length(prb)
  sample_more <- sample.int(N, size = 2 * n, prob = prb, replace = TRUE)
  get_first_unique(sample_more, N, n)
}

其中'sample-fast.cpp'包含

#include <Rcpp.h>
using namespace Rcpp;


// [[Rcpp::export]]
IntegerVector get_first_unique(const IntegerVector& ind_sample, int N, int n) {

  LogicalVector is_chosen(N);
  IntegerVector ind_chosen(n);

  int i, k, ind;

  for (k = 0, i = 0; i < n; i++) {
    do {
      ind = ind_sample[k++];
    } while (is_chosen[ind-1]);
    is_chosen[ind-1] = true;
    ind_chosen[i] = ind;
  }

  return ind_chosen;
}

然后你得到：

system.time(ind <- sample_fast(n, prb))

不到1秒钟。

Answer 3

R构建为一次只使用一个CPU核心。运行代码多线程的最简单方法是Microsoft R Open。我不确定它是否能提高采样性能，但值得一试。如果没有，像并行或多核这样的多核软件包可能会为您提供帮助。问题是多核仅适用于某些类型的操作。

我不能对你的代码本身说太多，因为它不包含reproducable示例。

使采样运行得更快

3 个答案: