如何在R wordcloud中删除奇数字符

时间:2019-01-25 09:44:46

标签: r tm word-cloud

我正在尝试使用语料库和各种tm_map函数在R中构建词云。问题是我不断收到这个奇怪的符号,一个带有欧元符号和颠倒报价的符号。它在我的语料库中排名第二。 (有一个或两个,但它们相距甚远,所以问题较少。)

Word cloud with rogue €“

任何想法如何摆脱这一点?

这是.txt格式的文本在拉入R之前的示例:

  

向Virtual Replication 6的迁移增加了将复制复制到Amazon云存储平台中的功能,而复制以前只是单向的。   Zerto技术推广人员Gjisbert Janssen van Doorn说,在AWS中开发花费了更长的时间。 “我们开始了与Azure的双向复制。我们尝试通过API为我们支持的云进行本机开发,但AWS花费了更长的时间。” Zerto还添加了IBM Cloud的双向复制。 van Doorn表示,该公司没有计划增加对Google Cloud Platform的支持。 “这是我们一直关注的事情。它在心愿单上,而不是在路线图上。”他说。

这是通过Corpus()拉入R后的结果:

  

向Virtual Replication 6的迁移将复制复制到了AWS中,而以前只是单向的复制复制到了Amazon云存储平台中。\ n \ nZerto技术推广人员Gjisbert说,在AWS中开发花费了更长的时间。詹森·范·多恩(Janssen van Doorn)。 “从Azure到Azure的双向复制就是我们的起点。我们尝试通过API为我们支持的云进行本机开发,但是使用AWS花费的时间更长。Zuto还使用IBM Cloud添加了双向复制。 van Doorn表示,该公司没有计划增加对Google Cloud Platform的支持。 “这是我们一直关注的事情。他说,这是在愿望清单上,而不是在路线图上。

然后我运行以下代码:

# Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# Remove your own stop word
# specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("new", "products", "way", "back", 
"can", "need", "also", "â", "look", "will", "one", "right",
                                    "move", "gorge", "mathieu", "like", 
"said", "€“", "–", "â", "data",
                                    "use", "storage"))
# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)

之后,相同的正文如下:

  

虚拟复制增加了复制aws以前的单向使用亚马逊云平台需要花费更长的时间来开发aws zerto技术的传播者gjisbert janssen van doorn –双向复制azure开始尝试通过apis云本地开发需要更长的aws的支持–添加双向复制ibm cloud van doorn公司计划增加了对Google云平台的支持-我们一直在关注它的愿望清单,而不是路线图

因此,那些tm_map函数并没有消除所有垃圾,因此我从此处运行的词云仍然包含它们。

有什么办法解决这个问题吗?

1 个答案:

答案 0 :(得分:1)

如果您不介意使用额外的软件包,则可以使用textclean软件包,该软件包与tm函数结合使用效果很好。该软件包包含各种有用的函数,用于清除带有怪异字符,URL,表情符号等的文本。对于示例文本,您需要使用函数replace_curly_quote删除“和”字符,并使用replace_contraction进行替换“是”到“是”。请参见下面的工作示例。毕竟,您可以只使用wordcloud包来创建wordcloud。

txt <- "The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform. It had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. “Bi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.” Zerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. “It’s something we’re keeping an eye on. It’s on the wishlist rather than the roadmap,” he said."

library(tm)
library(textclean)

corpus <- VCorpus(VectorSource(txt))
corpus <- tm_map(corpus, content_transformer(tolower))

# function from textclean to remove curly quotes ” and ’
corpus <- tm_map(corpus, replace_curly_quote)
# function from textclean to replace "it's" to "it is"
corpus <- tm_map(corpus, replace_contraction)

# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))

my_stopwords <- c("new", "products", "way", "back", "can", "need", "also", 
                  "look", "will", "one", "right","move", "gorge", "mathieu", 
                  "like", "said", "data","use", "storage")

corpus <- tm_map(corpus, removeWords, my_stopwords)

#remove created whitespaces
corpus <- tm_map(corpus, stripWhitespace)

content(corpus)
[[1]]
[1] " virtual replication added replication aws previously oneway amazon cloud platform taken longer develop aws zerto technology evangelist gjisbert janssen van doorn bidirectional replication azure started try develop natively via apis clouds support taken longer aws zerto added bidirectional replication ibm cloud van doorn company plan add support google cloud platform something keeping eye wishlist rather roadmap "