R下拉列中的收集功能

时间:2019-10-31 15:57:19

标签: r ggplot2 tidytext

我正在将一些作者使用的语言与从古腾堡计划网站下载的数据进行比较,但是我在进行小标题操作时遇到了一些麻烦。我的最终目标是绘制一个图表,比较Herman Melville和Lewis Carroll的单词使用频率与Washington Irving的使用频率。但是,我的小标题没有Irving列,当我尝试在ggplot中调用它时是有问题的。

我希望我的outzm动作看起来像

frequency

但是看起来像

# A tibble: 72,984 x 4
   word             Irving     author     proportion
   <chr>             <dbl>     <chr>        <dbl>
1 a'dale          0.00000907   Melville   NA        
 2 aa             NA           Melville   0.0000246
 3 ab             NA           Melville   NA        
 4 aback          NA           Melville   0.0000369
 5 abana          NA           Melville   0.0000123
 6 abandon        0.0000363    Melville   0.0000861
 7 abandoned      0.000163     Melville   0.000172 
 8 abandoning     0.0000181    Melville   NA        
 9 abandonment    0.00000907   Melville   0.0000123
10 abasement      0.0000181    Melville   0.0000123
# ... with 72,974 more rows

,我不确定我聚集起来进行频率调整时做错了什么。

代码

# A tibble: 72,984 x 3
   word        author   proportion
   <chr>       <chr>         <dbl>
 1 a'dale      Melville NA        
 2 aa          Melville  0.0000246
 3 ab          Melville NA        
 4 aback       Melville  0.0000369
 5 abana       Melville  0.0000123
 6 abandon     Melville  0.0000861
 7 abandoned   Melville  0.000172 
 8 abandoning  Melville NA        
 9 abandonment Melville  0.0000123
10 abasement   Melville  0.0000123
# ... with 72,974 more rows

1 个答案:

答案 0 :(得分:1)

问题在于您如何使用gather()要收集的两列不是彼此相邻的,因此您不想使用:

frequency <- frequency_by_word_across_authors %>%
  gather(author, proportion, Carroll, Melville)


ggplot(frequency,
       aes(x = proportion,
           y = Irving,
           color = abs(Irving - proportion))) +
  geom_abline(color = "gray40", 
              lty = 2) +
  geom_jitter(alpha = 0.1, 
              size = 2.5,
              width = 0.3, 
              height = 0.3) +
  geom_text(aes(label = word),
            check_overlap = TRUE, 
            vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001),
                       low = "darkslategray4",
                       high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Irving Washington", x = NULL)

reprex package(v0.3.0)于2019-11-01创建