我正在将一些作者使用的语言与从古腾堡计划网站下载的数据进行比较,但是我在进行小标题操作时遇到了一些麻烦。我的最终目标是绘制一个图表,比较Herman Melville和Lewis Carroll的单词使用频率与Washington Irving的使用频率。但是,我的小标题没有Irving列,当我尝试在ggplot中调用它时是有问题的。
我希望我的outzm
动作看起来像
frequency
但是看起来像
# A tibble: 72,984 x 4
word Irving author proportion
<chr> <dbl> <chr> <dbl>
1 a'dale 0.00000907 Melville NA
2 aa NA Melville 0.0000246
3 ab NA Melville NA
4 aback NA Melville 0.0000369
5 abana NA Melville 0.0000123
6 abandon 0.0000363 Melville 0.0000861
7 abandoned 0.000163 Melville 0.000172
8 abandoning 0.0000181 Melville NA
9 abandonment 0.00000907 Melville 0.0000123
10 abasement 0.0000181 Melville 0.0000123
# ... with 72,974 more rows
,我不确定我聚集起来进行频率调整时做错了什么。
代码
# A tibble: 72,984 x 3
word author proportion
<chr> <chr> <dbl>
1 a'dale Melville NA
2 aa Melville 0.0000246
3 ab Melville NA
4 aback Melville 0.0000369
5 abana Melville 0.0000123
6 abandon Melville 0.0000861
7 abandoned Melville 0.000172
8 abandoning Melville NA
9 abandonment Melville 0.0000123
10 abasement Melville 0.0000123
# ... with 72,974 more rows
答案 0 :(得分:1)
问题在于您如何使用gather()
; 您要收集的两列不是彼此相邻的,因此您不想使用:
:
frequency <- frequency_by_word_across_authors %>%
gather(author, proportion, Carroll, Melville)
ggplot(frequency,
aes(x = proportion,
y = Irving,
color = abs(Irving - proportion))) +
geom_abline(color = "gray40",
lty = 2) +
geom_jitter(alpha = 0.1,
size = 2.5,
width = 0.3,
height = 0.3) +
geom_text(aes(label = word),
check_overlap = TRUE,
vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001),
low = "darkslategray4",
high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Irving Washington", x = NULL)
由reprex package(v0.3.0)于2019-11-01创建