我正在使用可以在库R中找到的食管数据集。该数据集显示年龄,烟酒和烟草消费与患食管癌的可能性之间是否存在相关性。
我有一个小图,显示了每组烟草消费组(“ tobgp”变量,共4组,从0-9g /天到30+)的平均食道癌数量(“ ncases”变量): / p>
X = subset(esoph, select = c("tobgp", "ncases"))
heights=tapply(X$ncases, X$tobgp, mean)
barplot(heights, main = "Mean number of cases by tobacco consumption",
names.arg = c("0-9", "10-19", "20-29", "30+"),
xlab="Daily tobacco consumption (grams)", ylab = "Number of cases")
我有兴趣了解每个组的ncase百分比是多少?我尝试过:
tobgp9_data <- esoph[which(esoph[,"tobgp"] == "0-9g/day"),]
tobgp9_noZero <- tobgp9_data[which(tobgp9_data[, "ncases"] > 0 ),]
sum (tobgp9_noZero$ncases)
tobgp19_data <- esoph[which(esoph[,"tobgp"] == "10-19"),]
tobgp19_noZero <- tobgp19_data[which(tobgp19_data[, "ncases"] > 0 ),]
sum (tobgp19_noZero$ncases)
tobgp29_data <- esoph[which(esoph[,"tobgp"] == "20-29"),]
tobgp29_noZero <- tobgp29_data[which(tobgp29_data[, "ncases"] > 0 ),]
sum (tobgp29_noZero$ncases)
tobgp30_data <- esoph[which(esoph[,"tobgp"] == "30+"),]
tobgp30_noZero <- tobgp30_data[which(tobgp30_data[, "ncases"] > 0 ),]
sum (tobgp30_noZero$ncases)
但是,这给了我这些tobgp子组ncase的总和,但这考虑了所有其他变量,例如“ agegp”(年龄组)和“ alcgp”(每日饮酒量)。
答案 0 :(得分:0)
您的代码似乎可以正确地按组计算ncases
的总和。要将其转化为所有案例的百分比,您只需将所有组的案例总数除以(本例中为200)。
我发现使用dplyr
按组生成计算要容易得多。
library(dplyr)
X %>%
mutate(ncases_total = sum(ncases)) %>%
group_by(tobgp) %>%
summarise(ncases_sum = sum(ncases),
ncases_total = first(ncases_total),
ncases_pct = 100 * (ncases_sum / ncases_total)) %>%
ungroup()
答案 1 :(得分:0)
考虑将function remove_linebreaks( var message ) {
return message.replace( /[\r\n]+/gm, "" );
}
和aggregate
组合使用:
ave
或者,全部使用agg_df <- aggregate(ncases ~ tobgp, esoph, sum)
agg_df$pct <- with(agg_df, ave(ncases, tobgp, FUN=sum) / sum(ncases))
agg_df
# tobgp ncases pct
# 1 0-9g/day 78 0.390
# 2 10-19 58 0.290
# 3 20-29 33 0.165
# 4 30+ 31 0.155
一行:
within
可以在其他分组中进行相同处理
agg_df <- within(aggregate(ncases ~ tobgp, esoph, sum),
pct <- ave(ncases, tobgp, FUN=sum) / sum(ncases))
agg_df
# tobgp ncases pct
# 1 0-9g/day 78 0.390
# 2 10-19 58 0.290
# 3 20-29 33 0.165
# 4 30+ 31 0.155
甚至多个分组:
within(aggregate(ncases ~ agegp, esoph, sum),
pct <- ave(ncases, agegp, FUN=sum) / sum(ncases))
# agegp ncases pct
# 1 25-34 1 0.005
# 2 35-44 9 0.045
# 3 45-54 46 0.230
# 4 55-64 76 0.380
# 5 65-74 55 0.275
# 6 75+ 13 0.065
within(aggregate(ncases ~ alcgp, esoph, sum),
pct <- ave(ncases, alcgp, FUN=sum) / sum(ncases))
# alcgp ncases pct
# 1 0-39g/day 29 0.145
# 2 40-79 75 0.375
# 3 80-119 51 0.255
# 4 120+ 45 0.225