简化代码以在人群中获得多种疾病比例

时间:2018-11-21 00:56:52

标签: r group-by dplyr bind

我有看起来像这样的数据

df <- data.frame (
cancer = c(1, 0, 0, 0, 0, 1, 0, 0, 0, 0),
CVD =    c(0, 1, 1, 0, 1, 0, 0, 0, 0, 0),
diab =   c(0, 0, 0, 1, 0, 1, 0, 0, 1, 0),
stroke = c(0, 1, 1, 0, 1, 0, 0, 0, 1, 0),
asthma = c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0),
LTC_count = c(1, 2, 2, 1, 4, 3, 0, 0, 2, 0))

我的数据更大,大约。一百万行。每行是一个人,变量对应于该人所患的疾病(1 =是)

我想要的是一个数据框,其中包含有和没有每种条件的人口比例。

这是我要生成所需的输出所要做的:

1)分别构建具有每种条件的人口比例

Prop_cancer <- df %>%
group_by(cancer) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "cancer") %>%
rename(Y_N = cancer) 

Prop_CVD <- df %>%
group_by(CVD) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "CVD") %>%
rename(Y_N = CVD)

Prop_diab <- df %>%
group_by(diab) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "diab") %>%
rename(Y_N = diab)

Prop_stroke <- df %>%
group_by(stroke) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "stroke") %>%
rename(Y_N = stroke)

Prop_asthma <- df %>%
group_by(asthma) %>%
summarise(count = n()) %>%
mutate(freq = round((count / sum(count))*100, digits = 1)) %>%
mutate(condition = "asthma") %>%
rename(Y_N = asthma)

将所有这些加在一起

Prop_allcond <- bind_rows(Prop_cancer, Prop_CVD, Prop_stroke, Prop_diab, Prop_asthma)

我有很多情况,并且有很多数据。有没有更简单/更快的方式来做到这一点?

我考虑过通过ifelse语句在原始数据帧中创建一个新的变量“ condition”,但这不允许一个人拥有多个条件,并且条件以我指定的顺序优先他们。

非常感谢您提供有关如何简化此代码的建议,以便使代码不会太长。

3 个答案:

答案 0 :(得分:2)

患有特定疾病的人口

colSums(df) / nrow(df) * 100
#cancer       CVD      diab    stroke    asthma LTC_count 
#20        30        30        40        20       150 

答案 1 :(得分:2)

使用dplyr可以单行完成,而无需收集和进行其他操作:

df %>% summarize_at(vars(-LTC_count),funs(sum(.)/n()))
  cancer CVD diab stroke asthma
1    0.2 0.3  0.3    0.4    0.2

如果我们想要是和否频率:

bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(sum(.)/n()*100)), 
  "N"=summarize_at(df,vars(-LTC_count),funs(sum(!.)/n()*100)),.id="id")

  id cancer CVD diab stroke asthma
1  Y     20  30   30     40     20
2  N     80  70   70     60     80

响应您对长数据集的请求,我可以执行以下操作,但坦率地说,如果您愿意,最好使用@Ronak的解决方案:

df1<-bind_rows("Y"=summarize_at(df,vars(-LTC_count),funs(count=sum(.), freq=sum(.)/n()*100)), 
                 "N"=summarize_at(df,vars(-LTC_count),funs(count=sum(!.), freq=sum(!.)/n()*100)),.id="Y_N")

df1<-bind_cols(select(gather(df1,"condition","count",ends_with("_count")),-ends_with("freq")),
          select(gather(df1,"condition","freq",ends_with("_freq")),freq))[,c(2,3,4,1)]

df1$condition<-gsub("_count","",df1$condition)

   condition count freq Y_N
1     cancer     2   20   Y
2     cancer     8   80   N
3        CVD     3   30   Y
4        CVD     7   70   N
5       diab     3   30   Y
6       diab     7   70   N
7     stroke     4   40   Y
8     stroke     6   60   N
9     asthma     2   20   Y
10    asthma     8   80   N

答案 2 :(得分:1)

借助tidyverse,我们可以使用gather将数据帧折叠成keyvalue对然后是group_by的长格式,并计算每个比率组。

library(tidyverse)

df %>%
  gather() %>%
  group_by(key, value) %>%
  summarise(freq = n()) %>%
  ungroup() %>%
  group_by(key) %>%
  mutate(freq = freq/sum(freq) * 100)


#   key    value  freq
#   <chr>  <dbl> <dbl>
# 1 CVD        0    70
# 2 CVD        1    30
# 3 asthma     0    80
# 4 asthma     1    20
# 5 cancer     0    80
# 6 cancer     1    20
# 7 diab       0    70
# 8 diab       1    30
# 9 stroke     0    60
#10 stroke     1    40

注意-我忽略了LTC_count列,因为该列似乎没有参与计算。


或者我们可以按照@Jake Kaupp的建议使用count来减少一些步骤

df %>%
  gather() %>%
  count(key, value) %>%
  group_by(key) %>%
  mutate(n = n/sum(n) * 100)