Question

我必须在包含800K文本的文本分类方面做一些工作。我一直试图运行一个我在以下链接中找到的实际例子：

http://garonfolo.dk/herbert/2015/05/r-text-classification-using-a-k-nearest-neighbour-model/

一切顺利，直到我得到以下指示：

# Transform dtm to matrix to data frame - df is easier to work with
mat.df <- as.data.frame(data.matrix(dtm), stringsAsfactors = FALSE)

运行几个小时后，我收到了一条错误消息：

Error: cannot allocate vector of size 583.9 Gb
In addition: Warning messages:
1: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)
2: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)
3: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)
4: In vector(typeof(x$v), prod(nr, nc)) :
  Reached total allocation of 8076Mb: see help(memory.size)

有没有办法克服这个错误？

例如，是否可以拆分data.matrix（dtm）以块的形式运行作业，然后以某种方式合并它们？或者以另一种方式或在Python中处理它更好？

由于

Answer 1

在as.data.frame()调用之前，输入以下代码行：

dtm <- removeSparseTerms(dtm, sparse=0.9)。

参数sparse=...是介于0和1之间的数字。它与您要保留的文档数成正比。在上面，不 90％。通常，您可以通过反复试验找到正确/最佳值。在你的情况下，你可以得到一个奇怪的数字，如0.79333。取决于你想做什么。

removeSparseTerms()删除了条款，但保持较小结果矩阵中的文档数量不变。因此，您将从12165735 * 800000元素矩阵转换为476 * 800000矩阵。现在可以在您的计算机上处理此问题。

如果没有，请尝试使用您的大矩阵巧妙的逐列拆分 - 应用 - 组合技巧。

R文本分类与800K文件

1 个答案: