我正在使用tm包来应用词干,我需要将结果数据转换为数据帧。 可以在R tm package vcorpus: Error in converting corpus to data frame找到解决方案,但在我的情况下,我将语料库的内容作为:
[[2195]]
i was very impress
而不是
[[2195]]
"i was very impress"
因此,如果我申请
data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)
结果将是
<NA>.
非常感谢任何帮助!
以下代码为例:
sentence <- c("a small thread was loose on the sandals, otherwise it looked good")
mycorpus <- Corpus(VectorSource(sentence))
mycorpus <- tm_map(mycorpus, stemDocument, language = "english")
inspect(mycorpus)
[[1]]
a small thread was loo on the sandals, otherwi it look good
data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)
text
1 <NA>
答案 0 :(得分:2)
通过申请
gsub("http\\w+", "", mycorpus)
输出有class = character,所以它适用于我的情况。
答案 1 :(得分:1)
我无法在Mac上的R 3.1.0中使用tm_0.6重现问题:
> data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE)
text
content a small thread was loos on the sandals, otherwis it look good
如果我得到了那些不受欢迎的结果,我会立即尝试:
data.frame(text=unlist(sapply(mycorpus, `[[`, "content")), stringsAsFactors=FALSE)
...推理因为'constent'
是一个列表元素名称[['content']]
应该能够进行串行提取。它还在我看来,这种方法可能不需要取消列表:
> data.frame(text=sapply(mycorpus, `[[`, "content"), stringsAsFactors=FALSE)
text
1 a small thread was loos on the sandals, otherwis it look good