Question

我目前正在努力构建像这样的词云：

要在下图中使用wordcloud，您需要一个包含4个文本的语料库，＆＃34; ATT＆＃34;，＆＃34; Verizon＆＃34;，＆＃34; T-Mobile＆＃34;，＆＃34; MetroPCS＆＃34;，每个都是一个字符向量。此示例来自此处的教程：Tutorial Mining Twitter with R。

我很挣扎，因为我从一个数据框开始，就像这个：

library(wordcloud)
library(tm)

element <- c("Adams Pearmain ", "Aia Ilu ", "Airlie Red Flesh", "Akane ", "Åkerö ", "Alkmene", "Allington Pippin ", "Ambrosia ", "Anna ", "Annurca ", "Antonovka ", "Apollo ", "Ariane ", "Arkansas Black ", "Arthur Turner")
qty <- c(2, 1, 4, 3, 6, 2, 1, 4, 3, 6, 2, 1, 4, 3, 6)
category1 <- c("Red", "Green", "Red", "Green", "Yellow", "Orange", "Red", "Red", "Green", "Red", "Green", "Yellow",  "Green", "Yellow", "Orange")
category2 <- c("small", "big", "big", "small", "small", "medium", "medium", "medium", big", "big", "small", "medium", "big", "very big", "medium")
d <- data.frame(element=element, qty=qty, category1=category1, category2=category2)

这给出了这个数据帧：

    element             qty category1   category2
1   Adams Pearmain      2   Red         small
2   Aia Ilu             1   Green       big
3   Airlie Red Flesh    4   Red         small
4   Akane               3   Green       big
5   Åkerö               6   Yellow      small
6   Alkmene             2   Orange      big
7   Allington Pippin    1   Red         small
8   Ambrosia            4   Red         big
9   Anna                3   Green       small
10  Annurca             6   Red         big
11  Antonovka           2   Green       small
12  Apollo              1   Yellow      big
13  Ariane              4   Green       small
14  Arkansas Black      3   Yellow      big
15  Arthur Turner       6   Orange      big

我想做一个比较wordcloud与wordcloud的颜色（绿色，红色，...）给出它们（qty列）。在现实世界的例子中，我在数据框中有一些有小数的文本（例如3 =它们已经被表达了三次）。

所以这里有一种方法可以复制数量与数量中的数量相同的次数......但是我被困了。到目前为止，我得到的是：

## Subsetting two dataframes to category2 values
wordBig <- d[d$category2 == "big",]
wordSmall <- d[d$category2 == "small",]

## Extracting the vectors int he category1 columns
wordSmall <- as.vector(wordSmall$category1)
wordBig <- as.vector(wordBig$category1)

## Building the list for the corpus
wordALL <- list(wordBig, wordSmall)

此处我使用的miningtwitter的示例没有工作，因为它使用C()而不是list()来创建要提供给Corpus()函数的元素。结果只是在相同的向量下添加了元素和结果语料库，它具有与列表中的元素一样多的文档，而不仅仅是2个文档。 因此，您应该使用list()函数来构建比较云。

## Building the corpus
corpus <- Corpus(VectorSource(wordALL), readerControl = list(language = "en"))

# Cleaning
myStopWords <- c(stopwords("english"), "other", "another'")
corpus <- tm_map(corpus, removeWords, myStopWords)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)

最好在停用词之后删除标点符号，特别是对于法语例如：c("l'", "j'", "d'", "c'", "qu'")，如果你不这样做，你会留下奇怪的单词，如"lesprit"。

## Matrix
tdm <- TermDocumentMatrix(corpus)
tdm <- as.matrix(tdm)

## Giving readable names
colnames(tdm) <- c("Big Size apples", "Small size apples")

## Plotting
comparison.cloud(tdm, max.words=800, scale=c(4,1), title.size=1.4)

所以这很有效，基本上是：

太棒了！但是现在在我的现实世界的例子中，我仍然卡住了，例如我在数据框中的位置：

    element             qty category1   category2
4   Akane               **3**   Green       big

实际应该是：

    element             qty category1   category2
4   Akane               1   Green       big
4   Akane               1   Green       big
4   Akane               1   Green       big

这将大大改变我的Wordclouds中的结果！所以我的问题是：如何重塑我的数据帧，或者如何在wordcloud或语料库中包含频率以纠正这个结果。

R比较wordcloud从数据框和数量

0 个答案: