Question

我正在使用sample生成一个数据向量，无需替换。

如果我生成的数据集足够大，则向量超出R的限制。

如何以这样一种方式表示这些数据，即我可以在不进行替换的情况下进行采样，但仍然可以处理大量数据集？

生成计数向量：

counts <- vector()
for (i in 1:1024) {
    counts <- c(counts, rep(i, times=data[i,]$readCount))
}

采样：

trial_fn <- function(counts) {
   replicate(num_trials, sample(counts, size=trial_size, replace=F), simplify=F)
}

trials <- trial_fn(counts)

Error: cannot allocate vector of size 32.0 Mb

是否有更稀疏或压缩的方式我可以代表这个并且仍然能够在没有替换的情况下进行采样？

Answer 1

如果我理解正确，您的data有1024行，其中readCount不同。您构建的矢量的第一个readCount值重复一次，第二个readCount重复两次，依此类推。

然后你想从这个载体中取样而不需要替换。所以基本上，你是以1 / sum(1:1024)的概率对第一个readCount进行采样，第二个readCount的概率为2 / sum(1:1024)，依此类推，每次提取一个值时，都会从集合中删除它

当然，最快速，最简单的方法是你的，但你也可以用更少的内存但却失去速度（显着）。这可以通过将提取概率提供给sample函数，一次提取一个值并手动“移除”提取的值来完成。

以下是一个例子：

# an example of your data
data <- data.frame(readCount=1:1024)

# custom function to sample
mySample <- function(values, size, nElementsPerValue){
  nElementsPerValue <- as.integer(nElementsPerValue)
  if(sum(nElementsPerValue) < size)
    stop("Total number of elements per value is lower than the sample size")
  if(length(values) != length(nElementsPerValue))
    stop("nElementsPerValue must have the same length of values")
  if(any(nElementsPerValue < 0))
    stop("nElementsPerValue cannot contain a negative numbers")

  # remove values having zero elements inside
  nElementsPerValue <- nElementsPerValue[which(nElementsPerValue > 0)]
  values <- values[which(nElementsPerValue > 0)]

  # pre-allocate the result vector
  res <- rep.int(0.0,size)
  for(i in 1:size){
    idx <- sample(1:length(values),size=1,replace=F,prob=nElementsPerValue)
    res[i] <- values[idx]
    # remove sampled value from nElementsPerValue
    nElementsPerValue[idx] <- nElementsPerValue[idx] - 1
    # if zero elements remove also from values
    if(nElementsPerValue[idx] == 0){
      values <- values[-idx]
      nElementsPerValue <- nElementsPerValue[-idx]
    }
  }
  return(res)
}

# just for reproducibility
set.seed(123)

# sample 100k values from readCount
system.time(
  a <- mySample(data$readCount, 100000, 1:1024), 
  gcFirst=T)

# on my machine it gives :
#   user  system elapsed 
#  10.63    0.00   10.67

表示R中的矢量，用于对现有存储器来说太大的采样

1 个答案: