使用dplyr tidyr保留汇总表中输入变量和因子水平的顺序

时间:2016-08-26 01:38:41

标签: r dplyr tidyr

我喜欢dplyrtidyr如何轻松创建包含多个预测变量和结果变量的单个汇总表。令我难过的一件事是在输出表中保留/定义预测变量的顺序及其因子水平的最后一步。

我想出了一个解决方案(下面),其中包括使用mutate手动创建一个因子变量,该变量将预测变量和预测变量值(例如“gender_female”)与水平相结合。期望的输出顺序。但是如果有很多变量,我的解决方案有点长,我想知道是否有更好的方法?

library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")

set.seed(1234)

dat <- data.frame(
  gender    = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
  ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
  outcome1  = sample(c(TRUE, FALSE), 100, replace = TRUE),
  outcome2  = sample(c(TRUE, FALSE), 100, replace = TRUE)
)

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  # Statement below creates variable for ordering output
  mutate(
    pred_ord = factor(interaction(predictor, addNA(pred_value), sep = "_"),
                      levels = c(paste("gender", levels(addNA(dat$gender)), sep = "_"),
                                 paste("ethnicity", levels(addNA(dat$ethnicity)), sep = "_")))
  ) %>%
  group_by(pred_ord, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  ungroup() %>%
  spread(key = outcome, value = n) %>%
  separate(pred_ord, c("Predictor", "Pred_value"))

Source: local data frame [9 x 4]

  Predictor Pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1    gender     Female       25       27
2    gender       Male       11       10
3    gender    Unknown       12       15
4 ethnicity      Maori       10        9
5 ethnicity    Pacific        7        7
6 ethnicity      Asian        6       12
7 ethnicity      Other       10        9
8 ethnicity   European        5        4
9 ethnicity    Unknown       10       11
Warning message:
attributes are not identical across measure variables; they will be dropped 

上表是正确的,因为Predictor和Predictor值都不是按字母顺序排列的。

修改

根据要求,如果使用默认排序(按字母顺序排列),则会生成此内容。有意义的是,当组合因子时,它们被转换为字符变量并且所有属性都被删除。

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, gender, ethnicity) %>%
  group_by(predictor, pred_value, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  spread(key = outcome, value = n)

Source: local data frame [9 x 4]

  predictor pred_value outcome1 outcome2
      (chr)      (chr)    (int)    (int)
1 ethnicity      Asian        6       12
2 ethnicity   European        5        4
3 ethnicity      Maori       10        9
4 ethnicity      Other       10        9
5 ethnicity    Pacific        7        7
6 ethnicity    Unknown       10       11
7    gender     Female       25       27
8    gender       Male       11       10
9    gender    Unknown       12       15
Warning message:
attributes are not identical across measure variables; they will be dropped 

3 个答案:

答案 0 :(得分:10)

如果您希望数据是这样排列的因素,则需要将它们转换回因子,因为gather强制转换为字符(它会警告您)。您可以使用gather的{​​{1}}参数来处理factor_key,但您需要汇总predictor的级别,因为它现在结合了原始的两个因素。简化一下:

pred_value

请注意,您需要使用library(tidyr) library(dplyr) dat %>% gather(key = predictor, value = pred_value, gender, ethnicity, factor_key = TRUE) %>% group_by(predictor, pred_value) %>% summarise_all(sum) %>% ungroup() %>% mutate(pred_value = factor(pred_value, levels = unique(c(levels_eth, levels_gnd), fromLast = TRUE))) %>% arrange(predictor, pred_value) ## # A tibble: 9 × 4 ## predictor pred_value outcome1 outcome2 ## <fctr> <fctr> <int> <int> ## 1 gender Female 25 27 ## 2 gender Male 11 10 ## 3 gender Unknown 12 15 ## 4 ethnicity Maori 10 9 ## 5 ethnicity Pacific 7 7 ## 6 ethnicity Asian 6 12 ## 7 ethnicity Other 10 9 ## 8 ethnicity European 5 4 ## 9 ethnicity Unknown 10 11 unique将重复的“未知”值排列在正确的位置; fromLast = TRUE会提前提出来。

答案 1 :(得分:4)

您可以在没有特殊包的情况下以更简洁有效的方式执行此操作:

rbind(aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")], 
                by = list(dat$gender), sum),
      aggregate(dat[,colnames(dat) %in% c("outcome1", "outcome2")], 
                by = list(dat$ethnicity), sum))

它以简单直接的方式聚合多个预测变量和结果变量,并且还避免必须创建属于您提到的复杂解决方案的变量。

   Group.1 outcome1 outcome2
1   Female       25       27
2     Male       11       10
3  Unknown       12       15
4    Maori       10        9
5  Pacific        7        7
6    Asian        6       12
7    Other       10        9
8 European        5        4
9  Unknown       10       11

如果您想重命名上面的列,只需将其分配给对象(例如mytable <-)并重命名它们(即colnames(mytable) <- c("Pred_value", "outcome1", "outcome2"))。如果要输入的变量太多,您还可以使用apply进行缩放。

答案 2 :(得分:0)

您可以为变量添加前缀,以强制变量以正确的顺序显示,例如“ X1_gender”,“ X2_ethnicity”。前缀可以在结尾加上mutate。这可能不是一个“整洁”的解决方案,但它对我的工作目的是导致我发此帖的问题。

library(dplyr)
library(tidyr)
levels_eth <- c("Maori", "Pacific", "Asian", "Other", "European", "Unknown")
levels_gnd <- c("Female", "Male", "Unknown")

set.seed(1234)

dat <- data.frame(
  X1_gender    = factor(sample(levels_gnd, 100, replace = TRUE), levels = levels_gnd),
  X2_ethnicity = factor(sample(levels_eth, 100, replace = TRUE), levels = levels_eth),
  outcome1  = sample(c(TRUE, FALSE), 100, replace = TRUE),
  outcome2  = sample(c(TRUE, FALSE), 100, replace = TRUE)
)

dat %>% 
  gather(key = outcome, value = outcome_value, contains("outcome")) %>%
  gather(key = predictor, value = pred_value, X1_gender, X2_ethnicity) %>%
  group_by(predictor, pred_value, outcome) %>%
  summarise(n = sum(outcome_value, na.rm = TRUE)) %>%
  spread(key = outcome, value = n) %>%
  mutate(predictor=gsub("^X[0-9]_","", predictor))
 

结果:

`summarise()` regrouping output by 'predictor', 'pred_value' (override with 
`.groups` argument)
# A tibble: 9 x 4
# Groups:   predictor, pred_value [9]
  predictor pred_value outcome1 outcome2
  <chr>     <chr>         <int>    <int>
1 gender    Female           16       21
2 gender    Male             12       15
3 gender    Unknown          18       16
4 ethnicity Asian             4        6
5 ethnicity European         13       13
6 ethnicity Maori             4        6
7 ethnicity Other             7       11
8 ethnicity Pacific          10        9
9 ethnicity Unknown           8        7
Warning message:
attributes are not identical across measure variables;
they will be dropped