目标是获取每种类型书籍的男性和女性答案分布。我使用波纹管数据框来说明问题。这是一个调查数据集,其中X1, X2, X3, X4
是给出问题的答案。该数据帧在转换数据以反映答案后将其反映为虚拟变量。
book_id, user_id, rate, X1, X2 ,X3 ,X4, Gender,genre
40,1,4.5,0,1,0,0,male,fiction
48,1,3.5,1,0,0,1,male,fiction
54,1,4,1,0,0,0,male,fiction
79,1,2.5,1,0,1,0,male,non-fiction
80,1,4.5,0,0,1,0,male,non-fiction
95,1,5,1,0,1,0,male,non-fiction
95,2,3,0,0,0,1,Female,non-fiction
99,2,4.5,0,0,1,0,Female,non-fiction
2,2,0.5,0,0,0,0,Female,non-fiction
5,2,4.5,1,0,1,0,Female,non-fiction
54,2,4,0,1,0,0,Female,fiction
79,2,2.5,1,0,1,0,Female,non-fiction
80,2,4.5,0,0,1,0,Female,non-fiction
7,2,4.5,1,0,1,0,Female,fiction
7,3,5,1,0,1,0,Female,fiction
9,3,4,0,0,1,0,Female,auto-bio
54,3,4,1,0,0,0,Female,fiction
79,3,2.5,1,0,1,0,Female,non-fiction
80,3,4.5,0,0,1,0,Female,non-fction
17,4,3.5,1,0,0,0,male,auto-bio
21,4,5,1,0,1,0,male,auto-bio
21,5,5,0,1,1,0,male,auto-bio
17,5,0.5,0,0,0,1,male,auto-bio
20,5,5,0,0,1,0,male,fiction
20,6,1.5,0,0,0,1,male,fiction
21,6,5,0,0,1,0,male,auto-bio
21,7,2,1,0,0,0,male,auto-bio
21,8,4.5,1,0,1,0,Female,auto-bio
20,8,4.5,1,0,1,0,Female,fiction
7,8,4.5,1,0,1,0,Female,fiction
22,9,5,0,0,1,0,male,fiction
54,9,4,1,0,0,0,male,fiction
79,9,2.5,1,0,1,0,male,non-fiction
80,10,4.5,1,0,1,0,male,non-fiction
22,10,4.5,0,1,1,0,male,fiction
22,11,0.5,0,0,1,0,Female,fiction
28,11,3.5,1,0,0,0,Female,auto-bio
我用dplyr
用genre
和Gender
然后用summarize
进行黑眼。
df <- books %>%
group_by(Gender, genre) %>%
summarize(x1 = round(sum(X1)/length(X1)),
x2 = round(100 * sum(X1)/length(X2)),
x3=round(100 * sum(X1)/length(X3)),
x4 =round(sum(X4)/length(X4))) %>%
melt(id.vars = c("Gender", "genre"))
我担心的是,这实际上并没有给出每个答案每个体裁的书籍或用户的唯一数量/百分比。还是呢?另外,由于我进行了总结并得出1(s),所以我不确定其实际反映的是用户百分比还是图书百分比。
ggplot(df, aes(as.factor(genre), value, color = variable, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Gender)
如何为每个答案的每个流派获得唯一数量的用户和书籍?我应该使用条件方法来在内部摘要select 'book_id' where X1 == 1
中获取唯一的ID吗?