因此,我在rstudio中使用了'babynames'软件包,并试图获取35个最常见的中性名称。我正在尝试根据50-50行的均方误差对名称进行排名(但是,我不确定如何执行此操作)。任何帮助将不胜感激! (同样在我的代码下方,我将放置给出的“参考代码”,其中包括前35个中性名称)。
参考代码:
actual_names <- c("Jessie", "Marion", "Jackie", "Alva", "Ollie",
"Jody", "Cleo", "Kerry", "Frankie", "Guadalupe",
"Carey", "Tommie", "Angel", "Hollis", "Sammie",
"Jamie", "Kris", "Robbie", "Tracy", "Merrill",
"Noel", "Rene", "Johnnie", "Ariel", "Jan",
"Devon", "Cruz", "Michel", "Gale", "Robin",
"Dorian", "Casey", "Dana", "Kim", "Shannon")
答案 0 :(得分:0)
我认为,有几种方法可以回答所提出的问题,因为在“最受欢迎”和“最中性”之间需要权衡。
这是一种准备数据以收集每个姓名的某些统计信息的方法。
library(babynames)
library(tidyverse)
babynames_share <-
babynames %>%
filter(year >= 1930, year <= 2012) %>%
count(name, sex, wt = n) %>%
spread(sex, n, fill = 0) %>%
mutate(Total = F + M,
F_share = F / Total,
MS_50 = ((F_share-0.5)^2 +
(0.5-F_share)^2) / 2)
看起来大约有100个名字具有完全的性别均等-但它们很少见:
babynames_share %>%
filter(F == M) %>%
arrange(-Total)
# A tibble: 100 x 6
name F M Total F_share RMS_50
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Tyjae 157 157 314 0.5 0
2 Callaway 128 128 256 0.5 0
3 Avyn 100 100 200 0.5 0
4 Zarin 92 92 184 0.5 0
5 Tkai 72 72 144 0.5 0
6 Rayen 57 57 114 0.5 0
7 Meco 43 43 86 0.5 0
8 Pele 40 40 80 0.5 0
9 Nijay 35 35 70 0.5 0
10 Mako 27 27 54 0.5 0
# … with 90 more rows
或者我们可以选择任意阈值作为男女通用。在上面的示例中,我计算了男性和女性百分比份额的均方误差。我们可以绘制出这样的图,以便在顶部看到性别非常明显的名称(按此度量,MS_50在0.25处最高),而在底部则显示中性名称。但是对我来说,尚不清楚我们应该把一个名字归为男女通用的程度。凯西是男性,占58.9%,因此男女均方误差为8.9%^ 2 = 0.79%?还是我们需要进一步研究男性占50.8%的杰西(Jessie)?
babynames_share %>%
ggplot(data = .,
aes(Total, MS_50, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 3) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)
在性别平等的“凯西”级别,这是前35名:
unisex_names <- babynames_share %>%
filter(MS_50 <= 0.00796) %>%
arrange(-Total) %>%
top_n(35, wt = Total)
有趣的是,可以看到整个名字范围,其中大多数男性在底部,女性在顶部,而男女通用在中间:
babynames_share %>%
ggplot(data = .,
aes(Total, F_share, label = name)) +
geom_point(size = 0.2, alpha = 0.1, color = "gray30") +
geom_text(data = . %>% filter(Total > 10000),
check_overlap = TRUE, size = 2) +
scale_x_log10(breaks = c(10^(1:7)),
labels = scales::comma)