使用dplyr从R中的多个列创建多个二维表

时间:2019-03-28 22:09:40

标签: r dplyr tidyr

我正在寻找一种有效的方法,用于从卡方统计量的R数据框中创建多个二维表。以下代码建立在this answer的基础上,是我的上一个关于按组获取卡方统计量的问题。现在,我想按组从输出中创建表。到目前为止,这是我使用UCLA R网站的hsbdemo数据框所得到的:

ml <- foreign::read.dta("https://stats.idre.ucla.edu/stat/data/hsbdemo.dta")

str(ml)

'data.frame':   200 obs. of  13 variables:
 $ id     : num  45 108 15 67 153 51 164 133 2 53 ...
 $ female : Factor w/ 2 levels "male","female": 2 1 1 1 1 2 1 1 2 1 ...
 $ ses    : Factor w/ 3 levels "low","middle",..: 1 2 3 1 2 3 2 2 2 2 ...
 $ schtyp : Factor w/ 2 levels "public","private": 1 1 1 1 1 1 1 1 1 1 ...
 $ prog   : Factor w/ 3 levels "general","academic",..: 3 1 3 3 3 1 3 3 3 3 ...

ml %>% 
  dplyr::select(prog, ses, schtyp) %>%
  table() %>%
  apply(3, chisq.test, simulate.p.value = TRUE) %>%
  lapply(`[`, c(6,7,9)) %>%
  reshape2::melt() %>%
  tidyr::spread(key = L2, value = value) %>%
  dplyr::rename(SchoolType = L1) %>%
  dplyr::arrange(SchoolType, prog) %>%
  dplyr::select(-observed, -expected) %>%
  reshape2::acast(., prog ~ ses ~ SchoolType ) %>%
  tbl_df()  

最后一个range语句之后的输出将产生此小标题(仅显示前五行):

       prog    ses SchoolType expected observed     stdres
1   general    low    private  0.37500        2  3.0404678
2   general middle    private  3.56250        3 -0.5187244
3   general   high    private  2.06250        1 -1.0131777
4  academic    low    private  1.50000        0 -2.5298221
5  academic middle    private 14.25000       14 -0.2078097

很容易选择一列,例如stdres,并将其传递给acast和tbl_df,这几乎是我想要的:

# A tibble: 3 x 6
  low.private middle.private high.private low.public middle.public high.public
        <dbl>          <dbl>        <dbl>      <dbl>         <dbl>       <dbl>
1       3.04          -0.519        -1.01      1.47         -0.236       -1.18
2      -2.53          -0.208         1.50     -0.940        -2.06         3.21
3      -0.377          1.21         -1.06     -0.331         2.50        -2.45

现在,我可以针对观察到的频率和预期的频率重复这些步骤,并按行绑定它们,但这似乎效率很低。输出将观察到堆叠在期望值上,堆叠在标准化残差上的频率。像这样:

 low.private middle.private high.private low.public middle.public high.public
        <dbl>          <dbl>        <dbl>      <dbl>         <dbl>       <dbl>
1       2              3            1         14            17            8   
2       0             14           10         19            30           32   
3       0              2            0         12            29            7   
4       0.375          3.56         2.06      10.4          17.6         10.9 
5       1.5           14.2          8.25      21.7          36.6         22.7 
6       0.125          1.19         0.688     12.9          21.7         13.4 
7       3.04          -0.519       -1.01       1.47         -0.236       -1.18
8      -2.53          -0.208        1.50      -0.940        -2.06         3.21
9      -0.377          1.21        -1.06      -0.331         2.50        -2.45

似乎应该有一种方法,而不必通过创建和处理列表来重复每一列的代码。提前致谢。

1 个答案:

答案 0 :(得分:1)

这可能是答案吗?

   ml1 <- ml %>% 
      dplyr::select(prog, ses, schtyp) %>%
      table() %>%
      apply(3, chisq.test, simulate.p.value = TRUE) %>%
      lapply(`[`, c(6,7,9)) %>%
      reshape2::melt()
   ml2 <- ml1 %>%
      dplyr::mutate(type=paste(ses, L1, sep=".")) %>%
      dplyr::select(-ses, -L1) %>%
      tidyr::spread(type, value)

这给你

      prog       L2 high.private high.public low.private low.public middle.private middle.public
1  general expected     2.062500   10.910714   0.3750000 10.4464286      3.5625000    17.6428571
2  general observed     1.000000    8.000000   2.0000000 14.0000000      3.0000000    17.0000000
3  general   stdres    -1.013178   -1.184936   3.0404678  1.4663681     -0.5187244    -0.2360209
4 academic expected     8.250000   22.660714   1.5000000 21.6964286     14.2500000    36.6428571
5 academic observed    10.000000   32.000000   0.0000000 19.0000000     14.0000000    30.0000000
6 academic   stdres     1.504203    3.212431  -2.5298221 -0.9401386     -0.2078097    -2.0607058
7 vocation expected     0.687500   13.428571   0.1250000 12.8571429      1.1875000    21.7142857
8 vocation observed     0.000000    7.000000   0.0000000 12.0000000      2.0000000    29.0000000
9 vocation   stdres    -1.057100   -2.445826  -0.3771236 -0.3305575      1.2081594     2.4999085

我不确定我完全了解您之后的情况。但是,基本上,创建一个新的SES和学校类型变量,然后基于该变量进行收集。显然,可以根据需要重新排序:-)