使用dplyr选择前n个组,然后绘制其他变量

时间:2019-01-18 20:09:11

标签: r ggplot2 dplyr

我有一个数据集,我试图通过计算一个类别来仅选择前n个,然后使用数据集中的其他变量进行绘图-基本是前n个聚集的一个级别,但需要返回到全部数据绘制在ggplot中。

因此在下面的问题中,我想要两个最常见的examName,然后根据facetwrap的数量绘制和year

ap <- 
      tribble(
        ~year, ~examName,
        2014, "Statistics",
        2015, "Statistics",
        2016, "Statistics",
        2016, "Statistics",
        2016, "Statistics",
        2016, "Statistics",
        2017, "Statistics",
        2017, "Statistics",
        2017, "Statistics",
        2017, "Statistics",
        2017, "Statistics",
        2013, "Macroeconomics",
        2013, "Macroeconomics",
        2014, "Macroeconomics",
        2015, "Macroeconomics",
        2016, "Macroeconomics",
        2016, "Macroeconomics",
        2016, "Macroeconomics",
        2016, "Macroeconomics",
        2016, "Macroeconomics",
        2017, "Macroeconomics",
        2017, "Macroeconomics",
        2017, "Macroeconomics",
        2017, "Macroeconomics",
        2017, "Macroeconomics",
        2017, "Macroeconomics",
        2013, "Calculus",
        2014, "Calculus",
        2015, "Calculus",
        2016, "Calculus",
        2017, "Calculus",
        2017, "Psychology",
        2017, "Psychology",
        2017, "Psychology",
        2017, "Psychology",
        2017, "Psychology",
        2018, "Psychology",
        2018, "Psychology")


ap_top <- ap %>% 
    count(examName, sort = TRUE) %>% 
    head(2) %>% 
    inner_join(ap, by = "examName") %>% 
    select(-n)

ap_top %>% 
    count(examName, year) %>% 
    ggplot(aes(x = year, y = n, group = examName)) +
    geom_line() +
    facet_wrap(~ examName)

我的想法是让我的前n名,然后inner_join回到原始数据集中。然后使用它进行绘图;本质上是使用内部联接作为过滤器。

我知道有更好的方法可以做到这一点,我希望有一个更优雅的解决方案!我全是耳朵!给出的示例数据集(很长很抱歉)。

2 个答案:

答案 0 :(得分:5)

您不需要inner_join(),我只需要在单独的语句中确定前两项考试,然后对这些考试进行过滤即可。

top_exams <- count(ap, examName) %>% 
  top_n(2, n) %>% pull(examName)

ap %>% 
  filter(examName %in% top_exams) %>% 
  count(year, examName) %>% 
  ggplot(aes(x = year, y = n, group = examName)) +
  geom_line() +
  facet_wrap(~ examName)

答案 1 :(得分:2)

另一种可能性:

ap %>% 
 group_by(examName) %>%
 mutate(temp = n()) %>%
 ungroup() %>%
 mutate(temp = dense_rank(desc(temp))) %>%
 filter(temp %in% c(1,2)) %>%
 select(-temp) %>%
 count(year, examName) %>% 
 ggplot(aes(x = year, y = n, group = examName)) +
 geom_line() +
 facet_wrap(~ examName)

它根据“ examName”对个案进行计数,并对计数进行排名。然后,它过滤具有最大和第二大计数的案例。