转换丢弃R中的文档错误

时间:2018-06-28 11:10:37

标签: r tm

每当我运行此代码时,tm_map行就会向我发送警告消息,如下所示 警告信息: 在tm_map.SimpleCorpus(docs,toSpace,“ /”)中:转换会删除文档

texts <- read.csv("./Data/fast food/Domino's/Domino's veg pizza.csv",stringsAsFactors = FALSE)
        docs <- Corpus(VectorSource(texts))
        toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
        docs <- tm_map(docs, toSpace, "/")
        docs <- tm_map(docs, toSpace, "@")
        docs <- tm_map(docs, toSpace, "\\|")
        docs <- tm_map(docs, content_transformer(tolower))
        docs <- tm_map(docs, removeNumbers)
        docs <- tm_map(docs, removeWords, stopwords("english"))
        docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
        docs <- tm_map(docs, removePunctuation)
        docs <- tm_map(docs, stripWhitespace)

1 个答案:

答案 0 :(得分:3)

仅当您使用content_transformer创建自己的特定功能时,此警告才会出现。而且仅当您拥有基于VectorSource的语料库时才会出现。

原因是在基础代码中进行了检查,以查看语料库内容的名称数量是否与语料库内容的长度匹配。将文本作为矢量读取时,没有文档名称,并且会弹出此警告。这只是一个警告,没有文档被丢弃。

请参阅以下示例:

text <- c("this is my text with a forward slash / and some other text")
library(tm)
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

text <- c("this is my text with a forward slash / and some other text")
text_corpus <- Corpus(VectorSource(text))
inspect(text_corpus)
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

[1] this is my text with a forward slash / and some other text

# warning appears here
text_corpus <- tm_map(text_corpus, toSpace, "/")
inspect(text_corpus)
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

[1] this is my text with a forward slash   and some other text

使用以下命令,您可以看到text_corpus中没有名称:

names(content(text_corpus))
NULL

如果您不希望出现此警告,则需要创建一个data.frame并将其用作DataframeSource的来源。

text <- c("this is my text with a forward slash / and some other text")
doc_ids <- c(1)

df <- data.frame(doc_id = doc_ids, text = text, stringsAsFactors = FALSE)
df_corpus <- Corpus(DataframeSource(df))
inspect(df_corpus)
# no warning appears
df_corpus <- tm_map(df_corpus, toSpace, "/")
inspect(df_corpus)

names(content(df_corpus))
"1"