我正在尝试使用包含数值和分类数据的数据集的描述性统计信息来构建表。我希望我的桌子看起来像这样:
NA单元格可以为空白,或者不显示。
我的数据看起来像这样:
df <- data.frame(
id = c(1:6),
country = c("United Kingdom", "United Kingdom", "United Kingdom",
"Canada", "Canada", "Germany"),
gender = c("Male", "Female", "Male", "Female", "Female", "Male"),
height = c(1.9, 1.8, 2.0, 1.7, 1.9, 2.1),
play_basketball = c("Yes", "Yes", "No", "Yes", "No", "Yes"),
stringsAsFactors = TRUE
)
我尝试过的事情包括:
ftable和prop.table可以处理分类数据,但是我不确定如何删除“否”列并添加(freq / total):
table1 <- ftable(df$country, df$gender, df$play_basketball)
prop.table(table1, 1)
No Yes
Canada Female 0.5 0.5
Male NaN NaN
Germany Female NaN NaN
Male 0.0 1.0
United Kingdom Female 0.0 1.0
Male 0.5 0.5
在数字方面,我知道如何手动计算每个均值和sd,但不知道如何执行,以便可以将其自动添加到表格中:
mean(subset(df, country == "United Kingdom" &
gender == "Male")$height, na.rm = TRUE)
sd(subset(df, country == "United Kingdom" &
gender == "Male")$height, na.rm = TRUE)
我正在标记dplyr,因为它以前使我摆脱了麻烦,但是我不是在寻找仅dplyr的解决方案。
答案 0 :(得分:1)
这里有一个使用data.table
library(data.table)
setDT(df)
df[,list(
heightMean = mean(height),
heightSd = sd(height),
basketballPlayers = sum(play_basketball == "Yes")/.N),
by = list(country,gender)]
答案 1 :(得分:1)
您可以使用dplyr::summarise
获取所有摘要统计信息,然后使用stringr::str_glue
轻松创建格式化的字符串。
如果您要分解表格所需的计算,则每个组都有身高的平均值和标准差,篮球运动员的数量,总行数以及篮球/总计的份额。
library(dplyr)
calcs <- df %>%
mutate(gender = forcats::fct_relevel(gender, "Male"),
country = forcats::fct_relevel(country, "United Kingdom", "Canada")) %>%
group_by(country, gender) %>%
summarise(mean_height = round(mean(height, na.rm = T), digits = 2),
sd_height = round(sd(height, na.rm = T), digits = 2),
count_bball = sum(play_basketball == "Yes"),
n = n(),
share_bball = count_bball / n) %>%
ungroup() %>%
tidyr::replace_na(list(sd_height = 0))
calcs
#> # A tibble: 4 x 7
#> country gender mean_height sd_height count_bball n share_bball
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 United Kingdom Male 1.95 0.07 1 2 0.5
#> 2 United Kingdom Female 1.8 0 1 1 1
#> 3 Canada Female 1.8 0.14 1 2 0.5
#> 4 Germany Male 2.1 0 1 1 1
然后,您可以将格式化的字符串粘合在一起,删除不需要的字符串,并有选择地将其放入打印格式。 tidyr::complete
为您提供了NA
值,用于数据中未包含的组的组合。
formatted <- calcs %>%
mutate(height = stringr::str_glue("{mean_height} ± {scales::percent(sd_height)}"),
bball = stringr::str_glue("{scales::percent(share_bball, accuracy = 1)} ({count_bball} / {n})")) %>%
tidyr::complete(country, gender) %>%
select(country, gender, height, bball)
knitr::kable(formatted)
|country |gender |height |bball |
|:--------------|:------|:---------|:------------|
|United Kingdom |Male |1.95 ± 7% |50% (1 / 2) |
|United Kingdom |Female |1.8 ± 0% |100% (1 / 1) |
|Canada |Male |NA |NA |
|Canada |Female |1.8 ± 14% |50% (1 / 2) |
|Germany |Male |2.1 ± 0% |100% (1 / 1) |
|Germany |Female |NA |NA |