Question

我有一个巨大的语料库，我只对我前面提到的一些术语的出现感兴趣。有没有办法使用tm包从语料库创建一个术语文档矩阵，其中只使用我预先指定的术语？

我知道我可以对语料库的结果TermDocumentMatrix进行子集化，但由于内存大小限制，我想避免构建完整的术语文档矩阵。

Answer 1

您可以通过构建自定义转换功能来修改语料库以仅保留所需的术语。有关详细信息，请参阅Vignette for the tm package和content_transformer函数的帮助：

library(tm)

# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))

# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern) 
  regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))

（仅供参考，上面的第二行代码改编自this SO answer。）

# The pattern we'll search for
keep = "sleep|dream|die"

# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]

以下是运行转换函数的结果：

<<PlainTextDocument (metadata: 7)>>
  c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")

这是我用来创建语料库的原始文本：

doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"

Answer 2

另一种过滤语料库的方法; 首先将您的值分配给元部分，例如 language ;通过使用变量 i 循环语料库的元素，检查您想要的任何内容，然后使用这些元属性进行过滤。

corpusz[[i]]$meta["language"] <- 'tur'

idx <- meta(corpusz, "language") ==  'tur'
filtered <- corpusz[idx]

现在过滤只包含我们想要的语料库元素。

如何在tm中仅为TermDocumentMatrix创建选择语料库术语的子集

2 个答案: