Question

我正在为朴素贝叶斯实现创建一个文档术语矩阵（简称dtm）（我知道有一个函数用于此，但我必须自己编写它来完成作业。）我写了一个函数，成功创建了dtm问题是结果矩阵占用了太多内存。例如，100 x 32000矩阵（0和1）的大小为24MB！当尝试使用完整的10k文档时，这会导致r中的崩溃行为。功能如下，玩具示例在最后3行。任何人都可以发现为什么“稀疏”功能特别会返回这样的内存密集型结果？

listAllWords <- function(docs)
{
  str1 <- strsplit(x=docs, split="\\s", fixed=FALSE)
  dictDupl <- unlist(str1)[!(unlist(str1) %in% stopWords)]
  dictionary <- unique(dictDupl)
}

#function to create the sparse matrix of words as they appear in each article segment
sparser <- function (docs, dictionary) 
{
  num.docs <- length(docs) #dtm rows
  num.words <- length(dictionary) #dtm columns
  dtm <- mat.or.vec(num.docs,num.words) # Instantiate dtm of zeroes
  for (i in 1:num.docs)
  {
    doc.temp <- unlist(strsplit(x=docs[i], split="\\s", fixed=FALSE)) #vectorize words
    num.words.doc <- length(doc.temp)
    for (j in 1:num.words.doc)
    {
      ind <- which(dictionary == doc.temp[j]) #loop over words and find index in dict.
      dtm[i,ind] <- 1 #indicate this word is in this document
    }
  }
  return(dtm)
}


docs <- c("the first document contains words", "the second document is also made of words", "the third document is words and a number 4")
dictionary <- listAllWords(docs)
dtm <- sparser(docs,dictionary)

如果它有任何区别我在Mac OSX的R Studio中运行它，64位

Answer 1

你问题的一部分当然是你实际上并没有存储整数，而是双倍。注意：

m <- mat.or.vec(100,32000)
m1 <- matrix(0L,100,32000)

> object.size(m)
25600200 bytes
> object.size(m1)
12800200 bytes

请注意mat.or.vec的代码中缺少“L”：

> mat.or.vec
function (nr, nc) 
if (nc == 1L) numeric(nr) else matrix(0, nr, nc)
<bytecode: 0x1089984d8>
<environment: namespace:base>

您还希望明确指定1L，否则R会在第一次分配后将所有内容转换为双打，我认为。您可以通过简单地在值1上方指定一个m1值并重新检查对象大小来验证。

我可能还应该提到函数storage.mode，它可以帮助您验证是否使用了整数。

Answer 2

如果您想经济地存储0/1值，我建议raw类型。

m8 <- matrix(0,100,32000)
m4 <- matrix(0L,100,32000)
m1 <- matrix(raw(1),100,32000)

raw类型每个值只需1个字节：

> object.size(m8)
25600200 bytes
> object.size(m4)
12800200 bytes
> object.size(m1)
3200200 bytes

以下是如何使用它：

> m1[2,2] = as.raw(1)
> m1[2,2]
[1] 01
> as.integer(m1[2,2])
[1] 1

Answer 3

如果真的想要经济实惠，请查看ff和bit个包裹。

数字矩阵占用的内存远远超过应有的数量--R

3 个答案: