Question

我正在使用R进行情感分析。已在Excel中创建了我的源文件，其中包含来宾的大约50条评论（每条评论记录在单行和单列中）。因此，所有评论都在A列中找到，没有标题。然后，该文件已另存为csv文件并存储在文件夹中。

我的R代码如下：

library (tm)
docs<-Corpus(DirSource('E:/Sentiment Analysis'))
#checking a particular review in the document
writeLines(as.character(docs[[20]]))

运行最后一行会给我带来错误消息。当我将其更改为writeLines(as.character(docs[[1]]))时，R会将所有评论显示为一个完整的段落。

如何解决此问题？

Answer 1

与tm::Corpus()一起使用的DirSource()函数将每个文件视为一个单独的文档，而不是将一个文件中的每一行都视为一个单独的文档。

要将文本文件的每一行作为一个单独的文档读取，可以使用Corpus(VectorSource())语法。

作为示例，我们将创建一个文本文件，从目录中读取文本文件，以说明Corpus()在DirSource()中的行为，以及在VectorSource()中如何读取文本。

# represent contents of the text file that was stored in 
# ./data/ExcelFile1.csv
aTextFile <- "This is line one of text.
This is line two of text. This is a second sentence in line two."

library(tm)
# read as the OP read it
corpusDir <- "./data/textMining"
aCorpus <- Corpus(DirSource(corpusDir))
length(aCorpus) # shows only one item in list, entire file

# use pipe as separator because documents include commas. 
aDataFrame <- read.table("./data/textMining/ExcelFile1.csv",header=FALSE,
                         sep="|",stringsAsFactors=FALSE)
# use VectorSource to treat each row as a separate document
aCorpus <- Corpus(VectorSource(aDataFrame$V1))
# print the two documents 
aCorpus[1]$content
aCorpus[2]$content

...以及输出。首先，我们用DirSource()阅读的语料库长度：

> length(aCorpus) # shows only one item in list, entire file
[1] 1

第二，我们将从第二次阅读中打印两行，说明将它们视为单独的文档。

> aCorpus <- Corpus(VectorSource(aDataFrame$V1))
> aCorpus[1]$content
[1] "This is line one of text."
> aCorpus[2]$content
[1] "This is line two of text. This is a second sentence in line two. "
>

为什么R将CSV文件中的所有行合并为一个完整的文档？

1 个答案: