创建一个表格,其中包含缺少数据的计数和百分比

时间:2019-07-15 14:07:29

标签: r datatable count percentage kableextra

我正在尝试创建一个具有预设尺寸的表格,并用R填写计数和百分比。这是针对R-markdown报告的。

这是我的示例数据的代码。

#This is the most realistic data I could produce.
Maj <- rep("Major A", times=50)
set.seed(24601) 
Race <- sample(c("Asian","Black", "Am Indian","Hawiian" ,"Hispanic","White","Two or More Races","Not Reported"),
                 prob=c(.01,.1,.01,.01,.02,.80,.05,.01),size=50, replace = T)
Sex <- sample(c("Female","Male"), prob=c(.98,.02),size=50,replace=T)

Enroll_MajorA <- cbind(Maj,Sex,Race)

我需要该表来计算计数和百分比,无论数据集中是否存在给定的种族和性别组合。这就是我需要的样子。

Table Format

我尝试单独计算表的每个值,R-markdown给我一个“内存错误”。我尝试创建一个计数和百分比表并将它们组合在一起,但是它并未提供我需要用于报告的所有种族/性别组合。我不确定从这里去哪里。请帮忙!

2 个答案:

答案 0 :(得分:3)

您可以使用aggregate。由于可以使用as.data.frame,因此可以保持矩阵不变,因为NROW会自动转换为可数因子。 m.agg <- do.call(data.frame, aggregate(. ~ Sex + Race, as.data.frame(Enroll_MajorA), function(x) c(count=as.integer(NROW(x)), share=NROW(x) / NROW(Enroll_MajorA)))) (大写字母)不能区分矩阵和向量。

expand.grid

要获取完整的设置,我们可以与res <- merge(as.data.frame(m.agg), expand.grid(Sex=c("Female", "Male"), Race=relevant.races), all=TRUE) # `relevant.races` below res[, 3:4][is.na(res[, 3:4])] <- 0 # transform `NA` into 0 to be nice res[order(res[, "Race"]), ] # order output # Sex Race Maj.count Maj.share # 1 Female Black 2 0.04 # 10 Male Black 0 0.00 # 2 Female Hawiian 1 0.02 # 3 Female Hispanic 1 0.02 # 11 Male Hispanic 0 0.00 # 4 Female Two or More Races 2 0.04 # 12 Male Two or More Races 0 0.00 # 5 Female White 44 0.88 # 13 Male White 0 0.00 # 6 Female Asian 0 0.00 # 14 Male Asian 0 0.00 # 7 Female Am. Indian 0 0.00 # 15 Male Am. Indian 0 0.00 # 8 Female Hawaiian 0 0.00 # 16 Male Hawaiian 0 0.00 # 9 Female Not Reported 0 0.00 # 17 Male Not Reported 0 0.00 合并,我们可能需要对其进行一些清理。

relevant.races <- c("Asian","Black", "Am. Indian", "Hawaiian" , "Hispanic", "White", 
                    "Two or More Races", "Not Reported")

Enroll_MajorA <- structure(c("Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Major A", "Major A", "Major A", 
"Major A", "Major A", "Major A", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "Female", "Female", 
"Female", "Female", "Female", "Female", "Female", "White", "White", 
"White", "Hawiian", "White", "White", "White", "White", "White", 
"White", "White", "White", "White", "Two or More Races", "White", 
"White", "White", "White", "White", "White", "White", "Hispanic", 
"White", "White", "White", "White", "White", "White", "Two or More Races", 
"White", "White", "White", "White", "White", "White", "White", 
"White", "Black", "White", "White", "Black", "White", "White", 
"White", "White", "White", "White", "White", "White", "White"
), .Dim = c(50L, 3L), .Dimnames = list(NULL, c("Maj", "Sex", 
"Race")))

数据

MatCalendar.updateTodaysDate()

答案 1 :(得分:0)

使用tidyverse.drop = FALSE的一种方法将包括缺失的因子水平

library(tidyverse)

Enroll_MajorA %>%
   group_by(Race, Sex, .drop = FALSE) %>%
   summarise(count = n()) %>%
   ungroup() %>%
   mutate(perc = count/sum(count)) %>%
   gather(key, value, -Sex, -Race) %>%
   unite(Race, Race, key) %>%
   spread(Race, value)

数据

正如@Cath所评论的,我们需要在数据中明确包括所有级别。

Maj <- rep("Major A", times=50)
set.seed(24601) 
Race <- factor(sample(c("Asian","Black", "Am Indian","Hawiian" ,"Hispanic","White","Two or More Races","Not Reported"),
           prob=c(.01,.1,.01,.01,.02,.80,.05,.01),size=50, replace = T), 
           levels=c("Asian","Black", "Am Indian","Hawiian" ,"Hispanic","White","Two or More Races","Not Reported"))
Sex <- factor(sample(c("Female","Male"), prob=c(.98,.02),size=50,replace=T), levels = c("Female","Male"))

Enroll_MajorA <- data.frame(Maj,Sex,Race)