Question

我有一个700.000+行的数据帧（myDF），每行有两列，id和text。该文本有140个字符文本（推文），我想运行一个情绪分析，我从网上得到它们。但是，无论我尝试什么，我都会在4gb内存的macbook上出现内存问题。

我在想，也许我可以遍历行，例如做前10个，然后是第10个......等等。（即使批量为100，我也会遇到问题）这会解决问题吗？以这种方式循环的最佳方式是什么？

我在这里发布我的代码：

library(plyr)
library(stringr)

# function score.sentiment
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
   # Parameters
   # sentences: vector of text to score
   # pos.words: vector of words of postive sentiment
   # neg.words: vector of words of negative sentiment
   # .progress: passed to laply() to control of progress bar

   # create simple array of scores with laply
   scores = laply(sentences,
   function(sentence, pos.words, neg.words)
   {

      # split sentence into words with str_split (stringr package)
      word.list = str_split(sentence, "\\s+")
      words = unlist(word.list)

      # compare words to the dictionaries of positive & negative terms
      pos.matches = match(words, pos.words)
      neg.matches = match(words, neg.words)

      # get the position of the matched term or NA
      # we just want a TRUE/FALSE
      pos.matches = !is.na(pos.matches)
      neg.matches = !is.na(neg.matches)

      # final score
    score = sum(pos.matches)- sum(neg.matches)
      return(score)
      }, pos.words, neg.words, .progress=.progress )

   # data frame with scores for each sentence
   scores.df = data.frame(text=sentences, score=scores)
   return(scores.df)
}

# import positive and negative words
pos = readLines("positive_words.txt")
neg = readLines("negative_words.txt")

# apply function score.sentiment


myDF$scores = score.sentiment(myDF$text, pos, neg, .progress='text')

Answer 1

对于700,000个140个字符的句子来说，4 GB的内存就足够了。计算情绪分数的另一种方法可能是更多的记忆和时间效率和/或更容易分解成块。而不是处理每个句子，将整组句子分成单词

words <- str_split(sentences, "\\s+")

然后确定每个句子中有多少个单词，并创建单个单词向量

len <- sapply(words, length)
words <- unlist(words, use.names=FALSE)

通过重用words变量，我释放了之前用于重新循环的内存（不需要显式调用垃圾收集器，这与@ cryo111中的建议相反！）。您可以使用pos.words查找单词是否在words %in% pos.words中，而不必担心NAs。但是我们可以有点聪明并计算这个逻辑向量的累积和，然后将每个句子中最后一个单词的累积和进行子集

cumsum(words %in% pos.words)[len]

并计算单词数作为此

的衍生物

pos.match <- diff(c(0, cumsum(words %in% pos.words)[len]))

这是您得分的pos.match部分。所以

scores <- diff(c(0, cumsum(words %in% pos.words)[len])) - 
          diff(c(0, cumsum(words %in% neg.words)[len]))

就是这样。

score_sentiment <-
    function(sentences, pos.words, neg.words)
{
    words <- str_split(sentences, "\\s+")
    len <- sapply(words, length)
    words <- unlist(words, use.names=FALSE)
    diff(c(0, cumsum(words %in% pos.words)[len])) - 
      diff(c(0, cumsum(words %in% neg.words)[len]))
}

这里的意图是一次性处理所有句子

myDF$scores <- score_sentiment(myDF$text, pos, neg)

这避免了循环，虽然与lapply相比并不具有本质上的低效率，而且如@joran所示，正确实现的朋友，与向量化解决方案相比效率非常低。可能sentences不会被复制到此处，并且返回（仅）分数不会浪费我们已经知道的记忆返回信息（句子）。最大的记忆将是sentences和words。

如果内存仍然存在问题，那么我将创建一个索引，可用于将文本拆分为较小的组，并计算每个
的分数
nGroups <- 10 ## i.e., about 70k sentences / group idx <- seq_along(myDF$text) grp <- split(idx, cut(idx, nGroups, labels=FALSE)) scorel <- lapply(grp, function(i) score_sentiment(myDF$text[i], pos, neg)) myDF$scores <- unlist(scorel, use.names=FALSE)

首先确保myDF$text实际上是一个字符，例如myDF$test <- as.character(myDF$test)

Answer 2

我认为很难对你的问题给出明确的答案，但这里有一些指示。对我有帮助的是经常使用垃圾收集器gc()以及从内存rm(obj_name)中删除不再需要的对象。您还可以考虑将数据传输到MySQL等数据库中。如果您将数据帧导出为csv并使用LOAD DATA INFILE ...，那么这很容易。然后它应该可以循环通过比100行更大的块（RODBC包是从R访问SQL数据库的好工具）。另一种方法是将数据保存在外部文件中并按块读取数据，但我不知道如何在R中有效地完成这项工作。关注资源监视器（任务管理器 - 性能 - 资源监视器 - 内存）也很有用。

顺便说一句：据我所知，单条推文消息的长度可达560字节（最大值）。 700k条目提供大约400MB的数据。虽然这是一个相当大的数据量，但4GB RAM应该没有问题。你的记忆中还有其他数据吗？你有其他程序在运行吗？

Answer 3

如果我理解正确，您希望使用循环将函数应用于十行的集合。这是一种通用的方法。我首先使用split创建一个包含十行的列表。它们没有订购，但它应该无关紧要，因为如果你愿意，可以在最后重新排序。然后，您可以在循环中应用函数，并使用rbind将结果添加到“out”向量中。

x <-matrix(1:100,ncol=1)
parts.start <-split(1:100,1:10) #creates list: divide in 10 sets of 10 lines

out <-NULL
for (i in 1:length(parts.start)){
res <- x[parts.start[[i]],,drop=FALSE]*2 #your function applied to elements of the list.
out <-rbind(out,res)
}
head(out)

     [,1]
[1,]    2
[2,]   22
[3,]   42
[4,]   62
[5,]   82
[6,]  102

R：内存耗尽，如何循环遍历行？

3 个答案: