我正在尝试进行一些分析并遇到障碍(更像是精神障碍)...
我有3个不同的因子变量:
Analyst
,Associate
,Manager
,Sr. Manger
,Director
,ED
,VP
Male
,Female
Mid-Year
,Year-End
,Beyond
我想检查Gender
和Cohort
上Timeframe
是否存在任何差异。也就是说,女性分析师比男性分析师更容易陷入Timeframe = "Beyond"
的境地。
我最初的想法是做这样的事情:
library(dplyr)
x <- df %>%
filter(Gender %in% c("Male","Female")) %>%
filter(!is.na("Timeframe")) %>%
group_by(Timeframe, Cohort, Gender) %>%
summarise(n = n()) %>%
mutate(freq = 100 * (n / sum(n)))
但这给了我不太有意义的百分比。理想情况下,我想得出以下结论:“在分析人员队列中,对于性别而言,年末或年中或以后的时间范围之间有或没有很大差异”
dput(head(df1,30))
structure(list(V1 = c("Female", "Male", "Male", "Male", "Male",
"Female", "Male", "Female", "Male", "Female", "Male", "Female",
"Male", "Female", "Female", "Female", "Male", "Female", "Female",
"Male", "Female", "Female", "Male", "Male", "Female", "Female",
"Male", "Male", "Female", "Female"), V2 = c("Executive Director",
"Executive", "Vice President", "Manager", "Director", "Executive Director",
"Manager", "Senior Manager", "Senior Manager", "Vice President",
"Director", "Senior Manager", "Manager", "Senior Manager", "Senior Manager",
"Senior Manager", "Executive Director", "Senior Manager", "Manager",
"Director", "Senior Manager", "Associate", "Vice President",
"Senior Manager", "Executive Director", "Manager", "Executive Director",
"Director", "Associate", "Senior Manager"), V3 = c("Beyond",
"Beyond", "Beyond", "Beyond", "Beyond", "Mid-Year Promotion",
"Beyond", "Year End Promotion", "Beyond", "Year End Promotion",
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion",
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion",
"Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond",
"Beyond", "Beyond", "Beyond")), row.names = c("1", "2", "4",
"5", "6", "7", "8", "10", "11", "12", "13", "14", "15", "16",
"17", "19", "21", "22", "23", "24", "25", "27", "28", "29", "30",
"31", "32", "33", "34", "35"), class = "data.frame")
答案 0 :(得分:1)
EJJ在他的评论中是正确的-您需要在summary函数之后取消分组。否则,您将按组计算百分比,而不是全部百分比。
x=df %>% filter(Gender %in% c('Male',"Female")) %>%
filter(!is.na(`Promotion Timeframe`)) %>%
group_by(`Promotion Timeframe`,Management_Level,Gender) %>%
dplyr::summarise(n=n()) %>%
ungroup() %>%
mutate(freq = 100* (n/sum(n)))
答案 1 :(得分:1)
我真的是1 picture == 1000 words
的粉丝,所以这里有两种方法可以直观地看到R中的功能。
此方法对gganimate
和ggplot2
软件包使用累积百分比和累积和。您可以使用参数(例如nframes
)进行调整,以符合自己的喜好。
g <- ggplot(dfcount, aes(x = gender, y = c, fill = timeframe)) +
geom_col(position = "identity") +
labs(title = "Gender and Promotion at Goliath National Bank",
subtitle = "Career level: {closest_state}",
x = "Gender",
y = "Number of employees",
fill = "Time of promotion") +
geom_label(aes(y = c, label = text)) +
scale_fill_manual(values = c("#ABE188", "#F7EF99", "#F1BB87"),
guide = guide_legend(reverse = TRUE)) +
transition_states(cohort, transition_length = 1, state_length = 3)
animate(g, nframes = 300)
set.seed(1701)
g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
"Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")
df <- data.frame(
gender = factor(sample(g, 1000, c(0.39, 0.61),
replace = TRUE), levels = g),
cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01),
replace = TRUE), levels = c),
timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6),
replace = TRUE), levels = t))
library(dplyr)
library(ggplot2)
library(gganimate)
dfcount <- df %>%
group_by(gender, cohort, timeframe) %>%
summarize(n = n()) %>%
mutate(cum = cumsum(n)) %>%
mutate(perc = n / sum(n)) %>%
mutate(cumperc = cumsum(perc)) %>%
mutate(text = paste(round(perc*100, 1), "%"))
dfcount <- dfcount[order(dfcount$cohort, dfcount$gender, desc(dfcount$c)), ]
这样
> head(dfcount)
# A tibble: 6 x 8
# Groups: gender, cohort [2]
gender cohort timeframe n c perc cperc text
<fct> <fct> <fct> <int> <int> <dbl> <dbl> <chr>
1 Female Analyst Beyond 73 126 0.579 1 57.9 %
2 Female Analyst Year-End 48 53 0.381 0.421 38.1 %
3 Female Analyst Mid-Year 5 5 0.0397 0.0397 4 %
4 Male Analyst Beyond 95 172 0.552 1 55.2 %
5 Male Analyst Year-End 70 77 0.407 0.448 40.7 %
6 Male Analyst Mid-Year 7 7 0.0407 0.0407 4.1 %
它也可以很简单:
plot(table(df$gender, df$timeframe),
main = "Gender vs. Timeframe",
sub = paste("A comparison of the careers of",
count(subset(df, gender == "Female")), "women and",
count(subset(df, gender == "Male")), "men"),
ylab = "Time of promotion")
第一行之后的所有内容都是可选的。显然,您可以使用ggplot2
,waffle
或类似方法使此图变得更漂亮。
set.seed(1701)
g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
"Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")
df <- data.frame(
gender = factor(sample(g, 1000, c(0.39, 0.61),
replace = TRUE), levels = g),
cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01),
replace = TRUE), levels = c),
timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6),
replace = TRUE), levels = t))
这样
> head(df)
gender cohort timeframe
1 Male Associate Year-End
2 Female Analyst Year-End
3 Male Manager Beyond
4 Male Associate Beyond
5 Female Associate Year-End
6 Male Manager Beyond
答案 2 :(得分:0)
也许您可以像这样检查频率矩阵:
# A tibble: 15 x 6
year sample1 sample2 sample3 sample4 sample5
<dbl> <chr> <chr> <chr> <chr> <chr>
1 2004 NA NA NA NA NA
2 2005 NA NA NA NA NA
3 2006 D U D D NA
4 2007 D UD DDD D UU
5 2008 DDD D D UUU D
6 2009 D D U U U
7 2010 U DDD UU D UU
8 2011 UU D UUU DDD UUU
9 2012 UUU U U D U
10 2013 U UU D U D
11 2014 D UUU DDD U U
12 2015 DDD U D NA NA
13 2016 NA NA NA NA NA
14 2017 NA NA NA NA NA
15 2018 NA NA NA NA NA
这给您关于数据分配方式的第一印象。 为了进行进一步的调查,您可以更精确地指定Null假设,以设置正确的检验。 看一下。皮尔逊卡方检验如下:
table(df1[df1$V1=="Male",2:3])
table(df1[df1$V1=="Female",2:3])