将`top_n`和`arrange`传递给ggplot(dplyr)

时间:2018-05-16 16:25:22

标签: r ggplot2 tidytext

我想在我自己的数据集中复制TidyText Mining Section 3.3中的一大块代码。但是,在我的数据中,我无法让ggplot记得'我希望数据按降序排列,我想要一个top_n

我可以从TidyText Mining运行代码,我得到了本书所示的相同图表。但是,当我在自己的数据集上运行时,facet换行不会显示top_n(它们似乎显示随机数量的类别),并且每个facet中的数据不按降序排序。

我可以用一些随机文本数据和完整代码复制这个问题 - 但我也可以用mtcars复制这个问题 - 这让我很困惑。

我希望下面的图表以每个方面的降序显示mpg,并且每个方面只给我顶部的 1 类别。它既不适合我。

require(tidyverse)

mtcars %>%
  arrange (desc(mpg)) %>%
  mutate (gear = factor(gear, levels = rev(unique(gear)))) %>%
  group_by(am) %>%
  top_n(1) %>%
  ungroup %>%
  ggplot (aes (gear, mpg, fill = am)) +
  geom_col (show.legend = FALSE) +
  labs (x = NULL, y = "mpg") +
  facet_wrap(~am, ncol = 2, scales = "free") + 
  coord_flip()

但我真正想要的是像TidyText书中那样排序这样的图表(仅作为数据)。

require(tidyverse)
require(tidytext)

starwars <- tibble (film = c("ANH", "ESB", "ROJ"),
                  text = c("It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.....",
                           "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy. Evading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth. The evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space....",
                           "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt. Little does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star. When completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy...")) %>%
  unnest_tokens(word, text) %>%
  mutate(film = as.factor(film)) %>%
  count(film, word, sort = TRUE) %>%
  ungroup()

total_wars <- starwars %>%
  group_by(film) %>%
  summarize(total = sum(n))

starwars <- left_join(starwars, total_wars)

starwars <- starwars %>%
  bind_tf_idf(word, film, n)

starwars %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(film) %>%
  top_n(10) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = film)) +
  geom_col(show.legend = FALSE) +
  labs (x = NULL, y = "tf-idf") +
  facet_wrap(~film, ncol = 2, scales = "free") +
  coord_flip()

1 个答案:

答案 0 :(得分:2)

我相信你在这里绊倒的是top_n()默认为表中的最后一个变量,除非你告诉它用于排序的变量。在本书的示例中,数据框中的最后一个变量是tf_idf,因此用于排序。在mtcars示例中,top_n()正在使用数据框中的最后一列进行排序;恰好是carb

您可以通过将其作为参数传递,始终告诉top_n()您要用于排序的变量。例如,使用钻石数据集查看此类似工作流程。

library(tidyverse)

diamonds %>%
  arrange(desc(price)) %>%
  group_by(clarity) %>%
  top_n(10, price) %>%
  ungroup %>%
  ggplot(aes(cut, price, fill = clarity)) +
  geom_col(show.legend = FALSE, ) +
  facet_wrap(~clarity, scales = "free") + 
  scale_x_discrete(drop=FALSE) +
  coord_flip()

reprex package(v0.2.0)创建于2018-05-17。

这些示例数据集并不是完美的平行线,因为它们不像整齐的文本数据框那样,每个特征组合都有一行。我很确定top_n()的问题是问题所在。