绘制具有重复因子的ggplot组的顺序

时间:2017-07-24 17:53:11

标签: r ggplot2 tidyverse

我正在玩一些文本分析,并尝试使用逆文档频率(数值)显示每本书的顶部单词。我一直在跟随TidyText采矿,但使用哈利波特。

一些书籍之间的顶级单词(使用IDF)是相同的(例如Lupin或Griphook),并且在绘图时,顺序使用该单词的最大IDF。例如,griphook是Sorcerer's Stone和Deathly Hallows中的关键词。它在死亡圣器中的值为.0007但仅为0.0002,但是被命名为巫师之石的最高值。

ggplot output

hp.plot <- hp.words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))

##For correct ordering of books
hp.plot$book <- factor(hp.plot$book, levels = c('Sorcerer\'s Stone', 'Chamber of Secrets',
                                                 'Prisoner of Azkhaban', 'Goblet of Fire',
                                                 'Order of the Phoenix', 'Half-Blood Prince',
                                                 'Deathly Hallows'))

hp.plot %>%
  group_by(book) %>% 
  top_n(10) %>% 
  ungroup %>%
  ggplot(aes(x=word, y=tf_idf, fill = book, group = book)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, scales = "free") +
  coord_flip()

here's数据框的图像供您参考。

我之前尝试过排序,但这似乎不起作用。有什么想法吗?

修改:CSV is here

3 个答案:

答案 0 :(得分:2)

reorder()函数会按指定变量重新排序因子(请参阅?reorder)。

在绘图前的最后一个块中mutate(word = reorder(word, tf_idf))之后插入ungroup()应按tf_idf重新排序。我没有您的数据样本,但使用janeaustenr包,这也是一样的:

library(tidytext)
library(janeaustenr)
library(dplyr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

book_words <- book_words %>%
  bind_tf_idf(word, book, n) 


library(ggplot2)
book_words %>% 
  group_by(book) %>%
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, tf_idf)) %>% 
  ggplot(aes(x = word, y = tf_idf, fill = book, group = book)) + 
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~book, scales = "free") +
  coord_flip()

答案 1 :(得分:1)

之前已经回答了一个问题,但我并不熟悉ggplot的术语。它在下面的SO帖子中回答。

ggplot: Order bars in faceted bar chart per facet

答案 2 :(得分:0)

如果您想手动更改因子级别的顺序,可以尝试:

word = factor(word, levels = word[c(grep("griphook", word)[1], grep("quirrell", word)[1], ...)]);

如果要通过tf_idf订购因子水平,您可以使用以下内容:

level_ordered =rep(0, l)
for (i in 0: (l-1))
{
    level_ordered = c(level_ordered, grep(as.character((sort(tf_idf, partial=l-i)[l-i])), tf_idf)[1])
}
word = factor(word, levels=word[level_ordered])