Question

我正在使用tm包来应用词干，我需要将结果数据转换为数据帧。可以在R tm package vcorpus: Error in converting corpus to data frame找到解决方案，但在我的情况下，我将语料库的内容作为：

[[2195]]
i was very impress

而不是

[[2195]]
"i was very impress"

因此，如果我申请

data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)

结果将是

<NA>.

非常感谢任何帮助！

以下代码为例：

sentence <- c("a small thread was loose on the sandals, otherwise it looked good")
mycorpus <- Corpus(VectorSource(sentence))
mycorpus <- tm_map(mycorpus, stemDocument, language = "english")

inspect(mycorpus)

[[1]]
a small thread was loo on the sandals, otherwi it look good

data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)

 text
1 <NA>

Answer 1

通过申请

gsub("http\\w+", "", mycorpus)

输出有class = character，所以它适用于我的情况。

Answer 2

我无法在Mac上的R 3.1.0中使用tm_0.6重现问题：

> data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)
                                                                 text
content a small thread was loos on the sandals, otherwis it look good

如果我得到了那些不受欢迎的结果，我会立即尝试：

 data.frame(text=unlist(sapply(mycorpus, `[[`, "content")), stringsAsFactors=FALSE)

...推理因为'constent'是一个列表元素名称[['content']]应该能够进行串行提取。它还在我看来，这种方法可能不需要取消列表：

> data.frame(text=sapply(mycorpus, `[[`, "content"), stringsAsFactors=FALSE)
                                                           text
1 a small thread was loos on the sandals, otherwis it look good

将语料库转换为R中的data.frame

2 个答案: