从原始数据和汇总数据生成相同的图表

时间:2014-06-23 02:51:50

标签: r ggplot2 dplyr stacked stackedbarseries

来自原始数据和汇总数据的相同图

对于以下数据结构

dsN<-data.frame(
  id=rep(1:100, each=4),
  yearF=factor(rep(2001:2004, 100)),
  attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
dsN[sample(which(dsN$yearF==2001), 5), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2002), 10), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2003), 15), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2004), 20), "attendF"]<-NA

attcol8<-c("Never"="#4575b4",
           "Once or Twice"="#74add1",
           "Less than once/month"="#abd9e9",
           "About once/month"="#e0f3f8",
           "About twice/month"="#fee090",
           "About once/week"="#fdae61",
           "Several times/week"="#f46d43",
           "Everyday"="#d73027")
dsN$attendF<-factor(dsN$attendF, levels=1:8, labels=names(attcol8))
head(dsN,13)

   id yearF              attendF
1   1  2001      About once/week
2   1  2002     About once/month
3   1  2003      About once/week
4   1  2004                 <NA>
5   2  2001 Less than once/month
6   2  2002      About once/week
7   2  2003      About once/week
8   2  2004   Several times/week
9   3  2001        Once or Twice
10  3  2002      About once/week
11  3  2003                 <NA>
12  3  2004        Once or Twice
13  4  2001   Several times/week

我们可以获得一系列堆积条形图

require(ggplot2)
# p<- ggplot( subset(dsN,!is.na(attendF)), aes(x=yearF, fill=attendF)) # leaving NA out of
p<- ggplot( dsN, aes(x=yearF, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="fill")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

上图是根据原始数据生成的。但是,有时候 方便从汇总数据生成图表,特别是如果一个 需要控制统计功能。下面是dsN的转换 到只包含实际映射到的值的ds 上图:

require(dplyr)
ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = sum(attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    18   373 0.04826
    2   2001        Once or Twice    36   373 0.09651
    3   2001 Less than once/month    30   373 0.08043
    4   2001     About once/month    32   373 0.08579
    5   2001    About twice/month    40   373 0.10724
    6   2001      About once/week    90   373 0.24129
    7   2001   Several times/week   119   373 0.31903
    8   2001             Everyday     8   373 0.02145
    9   2002                Never    11   355 0.03099
    10  2002        Once or Twice    44   355 0.12394

# verify
summarize(filter(ds, yearF==2001), should.be.one=sum(percent))
```

    Source: local data frame [1 x 2]

      yearF should.be.one
    1  2001             1

问题:

如何使用此摘要数据集从上方重新创建图形 ds

2 个答案:

答案 0 :(得分:2)

嗯,部分问题是您的摘要不正确。如果要在总计中正确计算NA值,则需要将NA值保留在那里。也许试试

ds<- dsN %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length(attendF)) %.%
  dplyr::mutate(total = sum(count, na.rm=T),
              percent= count/total)

然后,要使用汇总数据,您只需稍微更改前两行

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")

请注意,我们添加了一个特定的y值,我们告诉geom_bar使用stat="identity",这意味着使用我们提供的实际y值作为高度。他们会产生相同的图像

enter image description here

答案 1 :(得分:0)

评论1回答

正如@MrFlick指出的那样,错误发生在summarize()中的计算公式中。但是,是否在总计算中留下缺失值是一项有意义的研究决策。 如果我们想要NA计算总数:

ds<- dsN %.%
#   dplyr::filter(!is.na(attendF)) %.%   # comment out to count NA in the total
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length( attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

     Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    23   100    0.23
    2   2001        Once or Twice     9   100    0.09
    3   2001 Less than once/month    16   100    0.16
    4   2001     About once/month    11   100    0.11
    5   2001    About twice/month     3   100    0.03
    6   2001      About once/week    21   100    0.21
    7   2001   Several times/week     9   100    0.09
    8   2001             Everyday     3   100    0.03
    9   2001                   NA     5   100    0.05
    10  2002                Never    17   100    0.17

缺失值用于计算要显示的总响应 研究中的自然减员。

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

然而,假设消耗与之无明显关联 结果衡量标准,看看它是多么相关是有意义的 反应背书的流行程度会随着时间的推移而变化,或者可能会停留 处于均衡状态。为此,我们需要从中删除缺失值 计算答复总数:

ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%   # comment out to count NA in the total
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length( attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    23    95 0.24211
    2   2001        Once or Twice     9    95 0.09474
    3   2001 Less than once/month    16    95 0.16842
    4   2001     About once/month    11    95 0.11579
    5   2001    About twice/month     3    95 0.03158
    6   2001      About once/week    21    95 0.22105
    7   2001   Several times/week     9    95 0.09474
    8   2001             Everyday     3    95 0.03158
    9   2002                Never    17    90 0.18889
    10  2002        Once or Twice    23    90 0.25556

图表相应地反映了这一点:

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

谢谢,@ MrFlick!