R中的数字和分类数据汇总表

时间:2019-12-06 13:52:48

标签: r dplyr

我正在尝试使用包含数值和分类数据的数据集的描述性统计信息来构建表。我希望我的桌子看起来像这样:

Summary table

NA单元格可以为空白,或者不显示。

我的数据看起来像这样:

df <- data.frame(
      id =  c(1:6),
      country = c("United Kingdom", "United Kingdom", "United Kingdom",
                  "Canada", "Canada", "Germany"),
      gender = c("Male", "Female", "Male", "Female", "Female", "Male"),
      height = c(1.9, 1.8, 2.0, 1.7, 1.9, 2.1),
      play_basketball = c("Yes", "Yes", "No", "Yes", "No", "Yes"),
      stringsAsFactors = TRUE
)

我尝试过的事情包括:

ftable和prop.table可以处理分类数据,但是我不确定如何删除“否”列并添加(freq / total):

table1 <- ftable(df$country, df$gender, df$play_basketball)
prop.table(table1, 1)
                        No Yes                           
Canada         Female  0.5 0.5
               Male    NaN NaN
Germany        Female  NaN NaN
               Male    0.0 1.0
United Kingdom Female  0.0 1.0
               Male    0.5 0.5

在数字方面,我知道如何手动计算每个均值和sd,但不知道如何执行,以便可以将其自动添加到表格中:

mean(subset(df, country == "United Kingdom" & 
                gender == "Male")$height, na.rm = TRUE)
sd(subset(df, country == "United Kingdom" & 
                gender == "Male")$height, na.rm = TRUE)

我正在标记dplyr,因为它以前使我摆脱了麻烦,但是我不是在寻找仅dplyr的解决方案。

2 个答案:

答案 0 :(得分:1)

这里有一个使用data.table

的选项
library(data.table)
setDT(df)


df[,list(
  heightMean = mean(height),
  heightSd = sd(height),
  basketballPlayers = sum(play_basketball == "Yes")/.N),

  by = list(country,gender)]

答案 1 :(得分:1)

您可以使用dplyr::summarise获取所有摘要统计信息,然后使用stringr::str_glue轻松创建格式化的字符串。

如果您要分解表格所需的计算,则每个组都有身高的平均值和标准差,篮球运动员的数量,总行数以及篮球/总计的份额。

library(dplyr)

calcs <- df %>%
  mutate(gender = forcats::fct_relevel(gender, "Male"),
         country = forcats::fct_relevel(country, "United Kingdom", "Canada")) %>%
  group_by(country, gender) %>%
  summarise(mean_height = round(mean(height, na.rm = T), digits = 2),
            sd_height = round(sd(height, na.rm = T), digits = 2),
            count_bball = sum(play_basketball == "Yes"),
            n = n(),
            share_bball = count_bball / n) %>%
  ungroup() %>%
  tidyr::replace_na(list(sd_height = 0))

calcs
#> # A tibble: 4 x 7
#>   country        gender mean_height sd_height count_bball     n share_bball
#>   <fct>          <fct>        <dbl>     <dbl>       <int> <int>       <dbl>
#> 1 United Kingdom Male          1.95      0.07           1     2         0.5
#> 2 United Kingdom Female        1.8       0              1     1         1  
#> 3 Canada         Female        1.8       0.14           1     2         0.5
#> 4 Germany        Male          2.1       0              1     1         1

然后,您可以将格式化的字符串粘合在一起,删除不需要的字符串,并有选择地将其放入打印格式。 tidyr::complete为您提供了NA值,用于数据中未包含的组的组合。

formatted <- calcs %>%
  mutate(height = stringr::str_glue("{mean_height} ± {scales::percent(sd_height)}"),
         bball = stringr::str_glue("{scales::percent(share_bball, accuracy = 1)} ({count_bball} / {n})")) %>%
  tidyr::complete(country, gender) %>%
  select(country, gender, height, bball)

knitr::kable(formatted)

|country        |gender |height    |bball        |
|:--------------|:------|:---------|:------------|
|United Kingdom |Male   |1.95 ± 7% |50% (1 / 2)  |
|United Kingdom |Female |1.8 ± 0%  |100% (1 / 1) |
|Canada         |Male   |NA        |NA           |
|Canada         |Female |1.8 ± 14% |50% (1 / 2)  |
|Germany        |Male   |2.1 ± 0%  |100% (1 / 1) |
|Germany        |Female |NA        |NA           |