数据集
gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
df <- data.frame(gender, answer)
偏向于女性:
df %>% ggplot(aes(gender, fill = gender)) + geom_bar()
我的任务是建立一个图表,以便轻松找出两个性别中哪一个更有可能说'Yes'
。
但是,鉴于偏见,我不能只做
df %>% ggplot(aes(x = answer, fill = gender)) + geom_bar(position = 'dodge')
甚至
df %>% ggplot(aes(x = answer, y = ..count../sum(..count..), fill = gender)) +
geom_bar(position = 'dodge')
为了减轻偏见,我需要将每个计数分别除以男性或女性的总数,以便'Female'
条加起来为1
以及'Male'
那些。像这样:
df.total <- df %>% count(gender)
male.total <- (df.total %>% filter(gender == 'Male'))$n
female.total <- (df.total %>% filter(gender == 'Female'))$n
df %>% count(answer, gender) %>%
mutate(freq = n/if_else(gender == 'Male', male.total, female.total)) %>%
ggplot(aes(x = answer, y = freq, fill = gender)) +
geom_bar(stat="identity", position = 'dodge')
这画出了完全不同的画面。
问题:
dplyr
和ggplot2
简化前一段代码?感谢。
答案 0 :(得分:2)
问题1:
df %>%
count(gender, answer) %>%
group_by(gender) %>%
mutate(freq = n/sum(n)) %>%
ggplot(aes(x = answer, y = freq, fill = gender)) +
geom_bar(stat="identity", position = 'dodge')
问题2:
您可以使用其他软件包在更少的行中完成。
问题3:
相对频率条形图。
答案 1 :(得分:2)
鉴于数据,确定男性或女性是否更有可能回答的最有效方法是&#34;是&#34;问的问题是将数据转换为二进制变量并运行比例差异测试。
gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
isYes <- ifelse(answer=="Yes",1,0)
t.test(isYes ~ gender)
...和输出:
> t.test(isYes ~ gender)
Welch Two Sample t-test
data: isYes by gender
t = -0.34659, df = 14.749, p-value = 0.7338
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5965761 0.4299094
sample estimates:
mean in group Female mean in group Male
0.4166667 0.5000000
t.test()
输出提供与加权频率图表相同的yes
百分比,但是来自检验统计量的p值表明我们应该接受零假设,即男性之间没有差异和女性有可能回答问题yes
。
另一种解释t.test()
输出的方法是,由于0在均值差的95%置信区间内,我们不能拒绝两组均值相等的零假设。
答案 2 :(得分:2)
position = "fill"
中的{p> geom_bar
对于查看相对比例非常有用:
library(ggplot2)
df <- data.frame(gender = c("Male", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Female", "Male", "Female", "Female", "Male", "Female", "Female"),
answer = c("Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes"),
stringsAsFactors = FALSE)
ggplot(df, aes(gender, fill = answer)) + geom_bar(position = 'fill')