分析3个不同类别变量之间的相关性的最佳方法

时间:2018-10-16 14:07:41

标签: r statistics

我正在尝试进行一些分析并遇到障碍(更像是精神障碍)...

目标

我有3个不同的因子变量:

  • 同类群组:AnalystAssociateManagerSr. MangerDirectorEDVP
  • 性别:MaleFemale
  • 时间范围:Mid-YearYear-EndBeyond

我想检查GenderCohortTimeframe是否存在任何差异。也就是说,女性分析师比男性分析师更容易陷入Timeframe = "Beyond"的境地。

代码

我最初的想法是做这样的事情:

library(dplyr)
x <- df %>% 
    filter(Gender %in% c("Male","Female")) %>% 
    filter(!is.na("Timeframe")) %>% 
    group_by(Timeframe, Cohort, Gender) %>% 
    summarise(n = n()) %>% 
    mutate(freq = 100 * (n / sum(n)))

但这给了我不太有意义的百分比。理想情况下,我想得出以下结论:“在分析人员队列中,对于性别而言,年末或年中或以后的时间范围之间有或没有很大差异”

数据

dput(head(df1,30))
structure(list(V1 = c("Female", "Male", "Male", "Male", "Male", 
"Female", "Male", "Female", "Male", "Female", "Male", "Female", 
"Male", "Female", "Female", "Female", "Male", "Female", "Female", 
"Male", "Female", "Female", "Male", "Male", "Female", "Female", 
"Male", "Male", "Female", "Female"), V2 = c("Executive Director", 
"Executive", "Vice President", "Manager", "Director", "Executive Director", 
"Manager", "Senior Manager", "Senior Manager", "Vice President", 
"Director", "Senior Manager", "Manager", "Senior Manager", "Senior Manager", 
"Senior Manager", "Executive Director", "Senior Manager", "Manager", 
"Director", "Senior Manager", "Associate", "Vice President", 
"Senior Manager", "Executive Director", "Manager", "Executive Director", 
"Director", "Associate", "Senior Manager"), V3 = c("Beyond", 
"Beyond", "Beyond", "Beyond", "Beyond", "Mid-Year Promotion", 
"Beyond", "Year End Promotion", "Beyond", "Year End Promotion", 
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion", 
"Beyond", "Beyond", "Beyond", "Beyond", "Beyond", "Year End Promotion", 
"Beyond", "Beyond", "Beyond", "Year End Promotion", "Beyond", 
"Beyond", "Beyond", "Beyond")), row.names = c("1", "2", "4", 
"5", "6", "7", "8", "10", "11", "12", "13", "14", "15", "16", 
"17", "19", "21", "22", "23", "24", "25", "27", "28", "29", "30", 
"31", "32", "33", "34", "35"), class = "data.frame")

3 个答案:

答案 0 :(得分:1)

EJJ在他的评论中是正确的-您需要在summary函数之后取消分组。否则,您将按组计算百分比,而不是全部百分比。

x=df %>% filter(Gender %in% c('Male',"Female")) %>% 
filter(!is.na(`Promotion Timeframe`)) %>% 
group_by(`Promotion Timeframe`,Management_Level,Gender) %>% 
dplyr::summarise(n=n()) %>% 
ungroup() %>%
mutate(freq = 100* (n/sum(n)))

答案 1 :(得分:1)

我真的是1 picture == 1000 words的粉丝,所以这里有两种方法可以直观地看到R中的功能。

1。先进的方法

1

此方法对gganimateggplot2软件包使用累积百分比和累积和。您可以使用参数(例如nframes)进行调整,以符合自己的喜好。

代码

g <- ggplot(dfcount, aes(x = gender, y = c, fill = timeframe)) +
    geom_col(position = "identity") +
    labs(title = "Gender and Promotion at Goliath National Bank",
         subtitle = "Career level: {closest_state}", 
         x = "Gender",
         y = "Number of employees",
         fill = "Time of promotion") +
    geom_label(aes(y = c, label = text)) +
    scale_fill_manual(values = c("#ABE188", "#F7EF99", "#F1BB87"), 
                      guide = guide_legend(reverse = TRUE)) + 
    transition_states(cohort, transition_length = 1, state_length = 3)
animate(g, nframes = 300)

数据

set.seed(1701)

g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
    "Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")

df <- data.frame(
    gender = factor(sample(g, 1000, c(0.39, 0.61),
        replace = TRUE), levels = g), 
    cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01), 
        replace = TRUE), levels = c),
    timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6), 
        replace = TRUE), levels = t))

library(dplyr)
library(ggplot2)
library(gganimate)
dfcount <- df %>% 
    group_by(gender, cohort, timeframe) %>%           
    summarize(n = n()) %>% 
    mutate(cum = cumsum(n)) %>%
    mutate(perc = n / sum(n)) %>%
    mutate(cumperc = cumsum(perc)) %>%
    mutate(text = paste(round(perc*100, 1), "%"))

dfcount <- dfcount[order(dfcount$cohort, dfcount$gender, desc(dfcount$c)), ]

这样

> head(dfcount)
# A tibble: 6 x 8
# Groups:   gender, cohort [2]
  gender cohort  timeframe     n     c   perc  cperc text  
  <fct>  <fct>   <fct>     <int> <int>  <dbl>  <dbl> <chr> 
1 Female Analyst Beyond       73   126 0.579  1      57.9 %
2 Female Analyst Year-End     48    53 0.381  0.421  38.1 %
3 Female Analyst Mid-Year      5     5 0.0397 0.0397 4 %   
4 Male   Analyst Beyond       95   172 0.552  1      55.2 %
5 Male   Analyst Year-End     70    77 0.407  0.448  40.7 %
6 Male   Analyst Mid-Year      7     7 0.0407 0.0407 4.1 % 

2。简单的方法

它也可以很简单:

1

代码

plot(table(df$gender, df$timeframe), 
     main = "Gender vs. Timeframe",
     sub = paste("A comparison of the careers of",
         count(subset(df, gender == "Female")), "women and",
         count(subset(df, gender == "Male")), "men"), 
     ylab = "Time of promotion")

第一行之后的所有内容都是可选的。显然,您可以使用ggplot2waffle或类似方法使此图变得更漂亮。

数据

set.seed(1701)

g <- c("Female", "Male")
c <- c("Analyst", "Associate", "Manager", "Senior Manager", "Director",
    "Executive Director", "Vice President")
t <- c("Mid-Year", "Year-End", "Beyond")

df <- data.frame(
    gender = factor(sample(g, 1000, c(0.39, 0.61),
        replace = TRUE), levels = g), 
    cohort = factor(sample(c, 1000, c(0.29, 0.34, 0.14, 0.11, 0.07, 0.04, 0.01), 
        replace = TRUE), levels = c),
    timeframe = factor(sample(t, 1000, c(0.05, 0.35, 0.6), 
        replace = TRUE), levels = t))

这样

> head(df)
  gender    cohort timeframe
1   Male Associate  Year-End
2 Female   Analyst  Year-End
3   Male   Manager    Beyond
4   Male Associate    Beyond
5 Female Associate  Year-End
6   Male   Manager    Beyond

答案 2 :(得分:0)

也许您可以像这样检查频率矩阵:

# A tibble: 15 x 6
    year sample1 sample2 sample3 sample4 sample5
   <dbl> <chr>   <chr>   <chr>   <chr>   <chr>  
 1  2004 NA      NA      NA      NA      NA     
 2  2005 NA      NA      NA      NA      NA     
 3  2006 D       U       D       D       NA     
 4  2007 D       UD      DDD     D       UU     
 5  2008 DDD     D       D       UUU     D      
 6  2009 D       D       U       U       U      
 7  2010 U       DDD     UU      D       UU     
 8  2011 UU      D       UUU     DDD     UUU    
 9  2012 UUU     U       U       D       U      
10  2013 U       UU      D       U       D      
11  2014 D       UUU     DDD     U       U      
12  2015 DDD     U       D       NA      NA     
13  2016 NA      NA      NA      NA      NA     
14  2017 NA      NA      NA      NA      NA     
15  2018 NA      NA      NA      NA      NA     

这给您关于数据分配方式的第一印象。 为了进行进一步的调查,您可以更精确地指定Null假设,以设置正确的检验。 看一下。皮尔逊卡方检验如下:

 table(df1[df1$V1=="Male",2:3])
 table(df1[df1$V1=="Female",2:3])