Question

当R在处理RAM中的给定数据集时遇到问题（给定PC配置）时，是否有任何规则要知道？

例如，我听说一条经验法则是每个单元格应该考虑8个字节。然后，如果我有1.000.000的1.000列的观察值接近8 GB - 因此，在大多数国内计算机中，我们可能必须将数据存储在HD中并以块的形式访问它。

以上是否正确？我们可以事先申请哪种内存大小和使用规则？我的意思是，不仅要加载对象，还要做一些基本的操作，如一些数据整理，一些数据可视化和一些分析（回归）。

PS：很好地解释经验法则是如何工作的，所以它不仅仅是一个黑盒子。

Answer 1

不同大小的一些向量的内存占用量，以字节为单位。

n <- c(1, 1e3, 1e6)
names(n) <- n
one_hundred_chars <- paste(rep.int(" ", 100), collapse = "")

sapply(
  n,
  function(n)
  {
    strings_of_one_hundred_chars <- replicate(
      n,
      paste(sample(letters, 100, replace = TRUE), collapse = "")
    )
    sapply(
      list(
        Integers                                 = integer(n),
        Floats                                   = numeric(n),
        Logicals                                 = logical(n),
        "Empty strings"                          = character(n),
        "Identical strings, nchar=100"           = rep.int(one_hundred_chars, n),
        "Distinct strings, nchar=100"            = strings_of_one_hundred_chars,
        "Factor of empty strings"                = factor(character(n)),
        "Factor of identical strings, nchar=100" = factor(rep.int(one_hundred_chars, n)),
        "Factor of distinct strings, nchar=100"  = factor(strings_of_one_hundred_chars),
        Raw                                      = raw(n),
        "Empty list"                             = vector("list", n)
      ),
      object.size
    )
  }
)

有些值在64/32位R之间有所不同。

## Under 64-bit R
##                                          1   1000     1e+06
## Integers                                48   4040   4000040
## Floats                                  48   8040   8000040
## Logicals                                48   4040   4000040
## Empty strings                           96   8088   8000088
## Identical strings, nchar=100           216   8208   8000208
## Distinct strings, nchar=100            216 176040 176000040
## Factor of empty strings                464   4456   4000456
## Factor of identical strings, nchar=100 584   4576   4000576
## Factor of distinct strings, nchar=100  584 180400 180000400
## Raw                                     48   1040   1000040
## Empty list                              48   8040   8000040

## Under 32-bit R
##                                          1   1000     1e+06
## Integers                                32   4024   4000024
## Floats                                  32   8024   8000024
## Logicals                                32   4024   4000024
## Empty strings                           64   4056   4000056
## Identical strings, nchar=100           184   4176   4000176
## Distinct strings, nchar=100            184 156024 156000024
## Factor of empty strings                272   4264   4000264
## Factor of identical strings, nchar=100 392   4384   4000384
## Factor of distinct strings, nchar=100  392 160224 160000224
## Raw                                     32   1024   1000024
## Empty list                              32   4024   4000024

请注意，当存在大量重复的相同字符串时，因子的内存占用量小于字符向量（但不是当它们都是唯一的时）。

Answer 2

经验法则对于数字向量是正确的。数字向量使用40个字节来存储有关向量的信息，并为向量中的每个元素存储8。您可以使用object.size()功能查看：

object.size(numeric())  # an empty vector (40 bytes)  
object.size(c(1))       # 48 bytes
object.size(c(1.2, 4))  # 56 bytes

您可能不会在分析中使用数字向量。矩阵与向量增长相似（这是预期的，因为它们只是具有dim属性的向量。）

object.size(matrix())           # Not really empty (208 bytes)
object.size(matrix(1:4, 2, 2))  # 216 bytes
object.size(matrix(1:6, 3, 2))  # 232 bytes (2 * 8 more after adding 2 elements)

Data.frames更复杂（它们具有比简单向量更多的属性），因此它们增长得更快：

object.size(data.frame())                  # 560 bytes
object.size(data.frame(x = 1))             # 680 bytes
object.size(data.frame(x = 1:5, y = 1:5))  # 840 bytes

记忆的一个很好的参考是Hadley Wickhams Advanced R Programming。

所有这些都说，请记住，为了在R中进行分析，您需要在内存中有一些缓冲，以允许R复制您可能正在处理的数据。

Answer 3

我无法完全回答你的问题，我强烈怀疑会有几个影响实际情况的因素，但如果你只是查看给定数据集的单个副本占用的原始内存量，您可以查看R internals的文档。

您将看到所需的内存量取决于所持有的数据类型。如果您在谈论数字数据，则这些数据通常为integer或numeric / real数据。这些术语分别由R内部类型INTSXP和REALSXP描述，描述如下：

INTSXP

length，truelength后跟一个C int块（32位开启）   所有R平台）。

REALSXP

length，truelength后跟一块C double

double的长度为64位（8字节），因此对于仅包含numeric值的数据集，您的“经验法则”似乎大致正确。类似地，对于整数数据，每个元素将占用4个字节。

Answer 4

试着总结答案，如果我错了，请纠正我。

如果我们不想低估所需的内存，并且如果我们想在几乎肯定会高估的意义上进行安全估算，那么我们似乎可以在每列中添加40个字节每个单元8个字节，然后在整理，绘图和分析时将其乘以“缓冲因子”（似乎是3周围）进行数据复制。

在一个功能中：

howMuchRAM <-function(ncol, nrow, cushion=3){
  #40 bytes per col
  colBytes <- ncol*40

  #8 bytes per cell
  cellBytes <- ncol*nrow*8

  #object.size
  object.size <- colBytes + cellBytes

  #RAM
  RAM <- object.size*cushion
  cat("Your dataset will have up to", format(object.size*9.53674e-7, digits=1), "MB and you will probably need", format(RAM*9.31323e-10,digits=1), "GB of RAM to deal with it.")
  result <- list(object.size = object.size, RAM = RAM, ncol=ncol, nrow=nrow, cushion=cushion)
}

所以在1.000.000 x 1.000数据帧的情况下：

howMuchRAM(ncol=1000,nrow=1000000)

Your dataset will have up to 7629 MB and you will probably need 22 GB of RAM to deal with it.

但正如我们在答案中所看到的，对象大小因类型而异，如果向量不是由独特的单元格组成，则它们的大小会更小，所以看起来这个估计值实际上是保守的。

R中数据集的内存大小的经验法则

4 个答案: