Question

我的目录中有一个文本文件列表，所有这些文件都是具有多个段落的文档。我想阅读这些文档并进行情感分析。

例如，我有一个文本文档data/hello.txt，其文本如下：

"Hello world.  
 This is an apple.

 That is an orange"

我以如下方式阅读文档（也可以有多个文档）：

docs <- VCorpus(DirSource('./data/hello.txt'))

当我查看文档内容docs[[1]]$content时，它似乎是字符向量。

[1] "hello  world"        "this is apple."      ""                   
[4] "That is an orange. " ""

我的问题是如何阅读这些文档，以便在每个文档中将段落连接到一个字符串中，以便可以将其用于情感分析。（来自tm包的VCorpus）

非常感谢。

Answer 1

您可以使用 readtext 包读取文本，然后使用VectorSource()构建VCorpus。

txt <- "Hello world.\nThis is an apple.\n\nThat is an orange"

tf <- tempfile("temp", fileext = ".txt")
cat(txt, file = tf)

library("readtext")
rtxt <- readtext(tf)

cat(rtxt$text)
## Hello world.
## This is an apple.
## 
## That is an orange

library("tm")
## Loading required package: NLP
docs <- VCorpus(VectorSource(rtxt$text))
cat(docs[[1]]$content)
## Hello world.
## This is an apple.
## 
## That is an orange

由readtext()创建的data.frame也可以直接在 quanteda 包中使用（功能更全的 tm 替代）。

# alternative
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(rtxt)  # works directly
cat(texts(corp))      # simpler?
## Hello world.
## This is an apple.
## 
## That is an orange

VCorpus(VectorSource(texts(corp))) # if you must...
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1

使用R中的tm包中的VCorpus将段落作为一个字符串的文本文件

1 个答案: