我有看起来像这样的数据
df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD = c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))
我的数据更大,大约。一百万行。每行是一个人,变量对应于该人所患的疾病(1 =是)
我想要的是一个数据框,其中包含有和没有每种条件的人口比例。
这是我要生成所需的输出所要做的:
1)分别构建具有每种条件的人口比例
Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer)
Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)
Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)
Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)
Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)
将所有这些加在一起
Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)
我有很多情况,并且有很多数据。有没有更简单/更快的方式来做到这一点?
我考虑过通过ifelse
语句在原始数据帧中创建一个新的变量“ condition”,但这不允许一个人拥有多个条件,并且条件以我指定的顺序优先他们。
非常感谢您提供有关如何简化此代码的建议,以便使代码不会太长。
答案 0 :(得分:2)
患有特定疾病的人口
colSums(df) / nrow(df) * 100
#cancer CVD diab stroke asthma LTC_count
#20 30 30 40 20 150
答案 1 :(得分:2)
使用dplyr
可以单行完成,而无需收集和进行其他操作:
df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
cancer CVD diab stroke asthma
1 0.2 0.3 0.3 0.4 0.2
如果我们想要是和否频率:
bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")
id cancer CVD diab stroke asthma
1 Y 20 30 30 40 20
2 N 80 70 70 60 80
响应您对长数据集的请求,我可以执行以下操作,但坦率地说,如果您愿意,最好使用@Ronak的解决方案:
df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)),
"N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")
df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]
df1$condition<-gsub("_count","",df1$condition)
condition count freq Y_N
1 cancer 2 20 Y
2 cancer 8 80 N
3 CVD 3 30 Y
4 CVD 7 70 N
5 diab 3 30 Y
6 diab 7 70 N
7 stroke 4 40 Y
8 stroke 6 60 N
9 asthma 2 20 Y
10 asthma 8 80 N
答案 2 :(得分:1)
借助tidyverse
,我们可以使用gather
将数据帧折叠成key
,value
对然后是group_by
的长格式,并计算每个比率组。
library(tidyverse)
df %>%
gather() %>%
group_by(key, value) %>%
summarise(freq = n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(freq = freq/sum(freq) * 100)
# key value freq
# <chr> <dbl> <dbl>
# 1 CVD 0 70
# 2 CVD 1 30
# 3 asthma 0 80
# 4 asthma 1 20
# 5 cancer 0 80
# 6 cancer 1 20
# 7 diab 0 70
# 8 diab 1 30
# 9 stroke 0 60
#10 stroke 1 40
注意-我忽略了LTC_count
列,因为该列似乎没有参与计算。
或者我们可以按照@Jake Kaupp的建议使用count
来减少一些步骤
df %>%
gather() %>%
count(key, value) %>%
group_by(key) %>%
mutate(n = n/sum(n) * 100)