如何获得每个子组的变量百分比?

时间:2019-12-16 15:26:40

标签: r

我正在使用可以在库R中找到的食管数据集。该数据集显示年龄,烟酒和烟草消费与患食管癌的可能性之间是否存在相关性。

我有一个小图,显示了每组烟草消费组(“ tobgp”变量,共4组,从0-9g /天到30+)的平均食道癌数量(“ ncases”变量): / p>

X = subset(esoph, select = c("tobgp", "ncases"))
heights=tapply(X$ncases, X$tobgp, mean)
barplot(heights, main = "Mean number of cases by tobacco consumption", 
        names.arg = c("0-9", "10-19", "20-29", "30+"),
        xlab="Daily tobacco consumption (grams)", ylab = "Number of cases")

我有兴趣了解每个组的ncase百分比是多少?我尝试过:

tobgp9_data <- esoph[which(esoph[,"tobgp"] == "0-9g/day"),]
tobgp9_noZero <- tobgp9_data[which(tobgp9_data[, "ncases"] > 0 ),]
sum (tobgp9_noZero$ncases)

tobgp19_data <- esoph[which(esoph[,"tobgp"] == "10-19"),]
tobgp19_noZero <- tobgp19_data[which(tobgp19_data[, "ncases"] > 0 ),]
sum (tobgp19_noZero$ncases)

tobgp29_data <- esoph[which(esoph[,"tobgp"] == "20-29"),]
tobgp29_noZero <- tobgp29_data[which(tobgp29_data[, "ncases"] > 0 ),]
sum (tobgp29_noZero$ncases)

tobgp30_data <- esoph[which(esoph[,"tobgp"] == "30+"),]
tobgp30_noZero <- tobgp30_data[which(tobgp30_data[, "ncases"] > 0 ),]
sum (tobgp30_noZero$ncases)

但是,这给了我这些tobgp子组ncase的总和,但这考虑了所有其他变量,例如“ agegp”(年龄组)和“ alcgp”(每日饮酒量)。

2 个答案:

答案 0 :(得分:0)

您的代码似乎可以正确地按组计算ncases的总和。要将其转化为所有案例的百分比,您只需将所有组的案例总数除以(本例中为200)。

我发现使用dplyr按组生成计算要容易得多。

library(dplyr)

X %>% 
  mutate(ncases_total = sum(ncases)) %>% 
  group_by(tobgp) %>% 
  summarise(ncases_sum = sum(ncases),
            ncases_total = first(ncases_total),
            ncases_pct = 100 * (ncases_sum / ncases_total)) %>% 
  ungroup() 

答案 1 :(得分:0)

考虑将function remove_linebreaks( var message ) { return message.replace( /[\r\n]+/gm, "" ); } aggregate组合使用:

ave

或者,全部使用agg_df <- aggregate(ncases ~ tobgp, esoph, sum) agg_df$pct <- with(agg_df, ave(ncases, tobgp, FUN=sum) / sum(ncases)) agg_df # tobgp ncases pct # 1 0-9g/day 78 0.390 # 2 10-19 58 0.290 # 3 20-29 33 0.165 # 4 30+ 31 0.155 一行:

within

可以在其他分组中进行相同处理

agg_df <- within(aggregate(ncases ~ tobgp, esoph, sum),
                  pct <- ave(ncases, tobgp, FUN=sum) / sum(ncases))

agg_df
#      tobgp ncases   pct
# 1 0-9g/day     78 0.390
# 2    10-19     58 0.290
# 3    20-29     33 0.165
# 4      30+     31 0.155

甚至多个分组:

within(aggregate(ncases ~ agegp, esoph, sum),
       pct <- ave(ncases, agegp, FUN=sum) / sum(ncases))

#   agegp ncases   pct
# 1 25-34      1 0.005
# 2 35-44      9 0.045
# 3 45-54     46 0.230
# 4 55-64     76 0.380
# 5 65-74     55 0.275
# 6   75+     13 0.065

within(aggregate(ncases ~ alcgp, esoph, sum),
       pct <- ave(ncases, alcgp, FUN=sum) / sum(ncases))

#       alcgp ncases   pct
# 1 0-39g/day     29 0.145
# 2     40-79     75 0.375
# 3    80-119     51 0.255
# 4      120+     45 0.225