Question

对R使用文本挖掘包tm，以下版本适用于版本0.6.2，R版本3.4.3：

library(tm)
a = "This is the first document."
b = "This is the second document."
c = "This is the third document."
d = "This is the fourth document."
docs1 = VectorSource(c(a,b))
docs2 = VectorSource(c(c,d))
corpus1 = Corpus(docs1)
corpus2 = Corpus(docs2)
corpus3 = c(corpus1,corpus2)
inspect(corpus3)
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 4

但是，tm版本0.7.3（R版本3.4.2）中的相同代码会出错：

Error in UseMethod("inspect", x) :
  no applicable method for 'inspect' applied to an object of class "list"

根据vignette("tm",package="tm")，c()函数已超载：

许多标准运算符和函数（[, [<-, [[, [[<-, c(), lapply()）可用于语料库，其语义与标准类似 R例程。例如，c()连接两个（或更多）语料库。应用于几个文本文档，它返回一个语料库。元数据是如果语料库连接（即合并），则自动更新。

然而，对于新版本，这显然不再是这种情况。如何在tm 0.7.3中合并两个语料库？一个明显的解决方案是首先合并文档并在之后创建语料库，但我正在寻找一种解决方案来组合两个已经存在的语料库。

Answer 1

我对Format(String, Object)软件包没有太多经验，所以我的答案可能在理解String.Format vs tm vs其他SimpleCorpus对象类时缺乏一些细微差别。

您对VCorpus的来电输入是班级tm;它看起来不像c专门针对此类的SimpleCorpus方法。因此，方法调度不会调用正确的tm来按照您想要的方式组合Corpora。但是，c类（c）有c方法。

有两种不同的方法可以解决将VCorpus强制转换为tm:::c.VCorpus的问题，但它们似乎会产生不同的结构。如果他们完成了你的最终目标，我会在下面列出并留给你。

1）您可以在定义`corpus3`：

时直接致电list

tm:::c.VCorpus

2）您可以在定义`corpus3`时使用> library(tm) > > a = "This is the first document." > b = "This is the second document." > c = "This is the third document." > d = "This is the fourth document." > docs1 = VectorSource(c(a,b)) > docs2 = VectorSource(c(c,d)) > corpus1 = Corpus(docs1) > corpus2 = Corpus(docs2) > > corpus3 = tm:::c.VCorpus(corpus1,corpus2) > > inspect(corpus3) <<VCorpus>> Metadata: corpus specific: 2, document level (indexed): 0 Content: documents: 4 [1] This is the first document. This is the second document. This is the third document. [4] This is the fourth document.＆amp; `VCorpus`

corpus1

将语料库合并到0.7.3

1 个答案:

1）您可以在定义`corpus3`：

将语料库合并到0.7.3

1 个答案:

1）您可以在定义corpus3：

1）您可以在定义`corpus3`：