要在下图中使用wordcloud,您需要一个包含4个文本的语料库," ATT"," Verizon"," T-Mobile", " MetroPCS",每个都是一个字符向量。此示例来自此处的教程:Tutorial Mining Twitter with R。
我很挣扎,因为我从一个数据框开始,就像这个:
library(wordcloud)
library(tm)
element <- c("Adams Pearmain ", "Aia Ilu ", "Airlie Red Flesh", "Akane ", "Åkerö ", "Alkmene", "Allington Pippin ", "Ambrosia ", "Anna ", "Annurca ", "Antonovka ", "Apollo ", "Ariane ", "Arkansas Black ", "Arthur Turner")
qty <- c(2, 1, 4, 3, 6, 2, 1, 4, 3, 6, 2, 1, 4, 3, 6)
category1 <- c("Red", "Green", "Red", "Green", "Yellow", "Orange", "Red", "Red", "Green", "Red", "Green", "Yellow", "Green", "Yellow", "Orange")
category2 <- c("small", "big", "big", "small", "small", "medium", "medium", "medium", big", "big", "small", "medium", "big", "very big", "medium")
d <- data.frame(element=element, qty=qty, category1=category1, category2=category2)
这给出了这个数据帧:
element qty category1 category2
1 Adams Pearmain 2 Red small
2 Aia Ilu 1 Green big
3 Airlie Red Flesh 4 Red small
4 Akane 3 Green big
5 Åkerö 6 Yellow small
6 Alkmene 2 Orange big
7 Allington Pippin 1 Red small
8 Ambrosia 4 Red big
9 Anna 3 Green small
10 Annurca 6 Red big
11 Antonovka 2 Green small
12 Apollo 1 Yellow big
13 Ariane 4 Green small
14 Arkansas Black 3 Yellow big
15 Arthur Turner 6 Orange big
我想做一个比较wordcloud与wordcloud的颜色(绿色,红色,...)给出它们(qty列)。在现实世界的例子中,我在数据框中有一些有小数的文本(例如3 =它们已经被表达了三次)。
所以这里有一种方法可以复制数量与数量中的数量相同的次数......但是我被困了。到目前为止,我得到的是:
## Subsetting two dataframes to category2 values
wordBig <- d[d$category2 == "big",]
wordSmall <- d[d$category2 == "small",]
## Extracting the vectors int he category1 columns
wordSmall <- as.vector(wordSmall$category1)
wordBig <- as.vector(wordBig$category1)
## Building the list for the corpus
wordALL <- list(wordBig, wordSmall)
此处我使用的miningtwitter的示例没有工作,因为它使用C()
而不是list()
来创建要提供给Corpus()
函数的元素。结果只是在相同的向量下添加了元素和结果语料库,它具有与列表中的元素一样多的文档,而不仅仅是2个文档。 因此,您应该使用list()
函数来构建比较云。
## Building the corpus
corpus <- Corpus(VectorSource(wordALL), readerControl = list(language = "en"))
# Cleaning
myStopWords <- c(stopwords("english"), "other", "another'")
corpus <- tm_map(corpus, removeWords, myStopWords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
最好在停用词之后删除标点符号,特别是对于法语例如:c("l'", "j'", "d'", "c'", "qu'")
,如果你不这样做,你会留下奇怪的单词,如"lesprit"
。
## Matrix
tdm <- TermDocumentMatrix(corpus)
tdm <- as.matrix(tdm)
## Giving readable names
colnames(tdm) <- c("Big Size apples", "Small size apples")
## Plotting
comparison.cloud(tdm, max.words=800, scale=c(4,1), title.size=1.4)
所以这很有效,基本上是:
太棒了!但是现在在我的现实世界的例子中,我仍然卡住了,例如我在数据框中的位置:
element qty category1 category2
4 Akane **3** Green big
实际应该是:
element qty category1 category2
4 Akane 1 Green big
4 Akane 1 Green big
4 Akane 1 Green big
这将大大改变我的Wordclouds中的结果!所以我的问题是:如何重塑我的数据帧,或者如何在wordcloud或语料库中包含频率以纠正这个结果。