删除停用词后,当我进一步清理R中的推文时,我的输出不会保存

时间:2018-04-23 16:01:30

标签: r output tm stop-words

我正在做情绪分析,我的目录中有两个文件 语料库1是正面推文,其他是负面推文,但在 比较wordcloud我有话这些是停用词。这意味着它不是 删除停用词("英语") 我创建了自定义停用词,但也未能保留该输出。之后我搜索并找到了一个stopwords.txt文件的停用词,我从github下载并用它来删除停用词。为此,我必须将语料库(原子向量)转换为表,然后转换为向量(数据帧)以读取此文件。我把它与tm库的停用词组合在一起 输出是预期的,但是当我试图删除标点符号并检查语料库时,输出只是根据removePunctuation输出而不保留停用词的输出。
然后,我尝试了removeNumbers并检查语料库,但它没有保留停用词的输出,但保留了removePunctuation的输出。 那么,这里有什么问题?

我在这里缺少什么? [这是代码]
[1] [这是使用R从推文中删除停用词后的输出] [2] [这是应用其他清理后的输出,如removePunctuation, removeNumbers,stipwhitespace,stemDocument但它没有保留删除的停用词输出]
[3]
    [1]:https://i.stack.imgur.com/RMbvD.png
    [2]:https://i.stack.imgur.com/18H3P.png
    [3]:https://i.stack.imgur.com/SxaJE.png

这是我使用过的代码。我把两个文本文件放在了 目录并将其转换为语料库。

library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-
Project/tweets"))
summary(tweets_corpus)
##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)
##removing stopwords##
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords, 
stopwords("english"))
inspect(clean_tweets_corpus)
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
class(stop)
stop
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
class(stop_vec)
stop_vec
clean_tweets_corpus <- tm_map(tweets_corpus, removeWords, 
c(stopwords("english"), stop_vec))
inspect(clean_tweets_corpus)
## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(tweets_corpus, 
content_transformer(remove_multiplechar))
inspect(clean_tweets_corpus)
clean_tweets_corpus <- tm_map(tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(tweets_corpus,removeNumbers)
clean_tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(tweets_corpus, stemDocument)
inspect(clean_tweets_corpus)
str(clean_tweets_corpus)

1 个答案:

答案 0 :(得分:0)

以下是更正的代码,在除{1}之外的所有tm_map调用中将“tweets_corpus”替换为“clean_tweets_corpus”:

library(tm)
tweets_corpus <- Corpus(DirSource(directory = "D:/New-RStudio-Project/tweets"))
summary(tweets_corpus)

##cleaning the tweets_corpus ##
clean_tweets_corpus <- tm_map(tweets_corpus, tolower)

##removing stopwords##
##having stopwords.txt (collection of stopwords) to remove the stopwords##
stop = read.table("stopwords.txt", header = TRUE)
stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)

clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeWords, 
                              c(stopwords("english"), stop_vec))

## remove to have single characters ##
remove_multiplechar<-function(x) gsub("\\b[A-z]\\b{1}"," ",x)
clean_tweets_corpus<-tm_map(clean_tweets_corpus, 
                            content_transformer(remove_multiplechar))

clean_tweets_corpus <- tm_map(clean_tweets_corpus, removePunctuation)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, removeNumbers)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stripWhitespace)
clean_tweets_corpus <- tm_map(clean_tweets_corpus, stemDocument)