Question

来自原始数据和汇总数据的相同图

对于以下数据结构

dsN<-data.frame(
  id=rep(1:100, each=4),
  yearF=factor(rep(2001:2004, 100)),
  attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
dsN[sample(which(dsN$yearF==2001), 5), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2002), 10), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2003), 15), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2004), 20), "attendF"]<-NA

attcol8<-c("Never"="#4575b4",
           "Once or Twice"="#74add1",
           "Less than once/month"="#abd9e9",
           "About once/month"="#e0f3f8",
           "About twice/month"="#fee090",
           "About once/week"="#fdae61",
           "Several times/week"="#f46d43",
           "Everyday"="#d73027")
dsN$attendF<-factor(dsN$attendF, levels=1:8, labels=names(attcol8))
head(dsN,13)

   id yearF              attendF
1   1  2001      About once/week
2   1  2002     About once/month
3   1  2003      About once/week
4   1  2004                 <NA>
5   2  2001 Less than once/month
6   2  2002      About once/week
7   2  2003      About once/week
8   2  2004   Several times/week
9   3  2001        Once or Twice
10  3  2002      About once/week
11  3  2003                 <NA>
12  3  2004        Once or Twice
13  4  2001   Several times/week

我们可以获得一系列堆积条形图

require(ggplot2)
# p<- ggplot( subset(dsN,!is.na(attendF)), aes(x=yearF, fill=attendF)) # leaving NA out of
p<- ggplot( dsN, aes(x=yearF, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="fill")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

上图是根据原始数据生成的。但是，有时候方便从汇总数据生成图表，特别是如果一个需要控制统计功能。下面是dsN的转换到只包含实际映射到的值的ds 上图：

require(dplyr)
ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = sum(attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    18   373 0.04826
    2   2001        Once or Twice    36   373 0.09651
    3   2001 Less than once/month    30   373 0.08043
    4   2001     About once/month    32   373 0.08579
    5   2001    About twice/month    40   373 0.10724
    6   2001      About once/week    90   373 0.24129
    7   2001   Several times/week   119   373 0.31903
    8   2001             Everyday     8   373 0.02145
    9   2002                Never    11   355 0.03099
    10  2002        Once or Twice    44   355 0.12394

# verify
summarize(filter(ds, yearF==2001), should.be.one=sum(percent))
```

    Source: local data frame [1 x 2]

      yearF should.be.one
    1  2001             1

问题：

如何使用此摘要数据集从上方重新创建图形 ds？

Answer 1

嗯，部分问题是您的摘要不正确。如果要在总计中正确计算NA值，则需要将NA值保留在那里。也许试试

ds<- dsN %.%
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length(attendF)) %.%
  dplyr::mutate(total = sum(count, na.rm=T),
              percent= count/total)

然后，要使用汇总数据，您只需稍微更改前两行

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")

请注意，我们添加了一个特定的y值，我们告诉geom_bar使用stat="identity"，这意味着使用我们提供的实际y值作为高度。他们会产生相同的图像

enter image description here

Answer 2

评论1回答

正如@MrFlick指出的那样，错误发生在summarize（）中的计算公式中。但是，是否在总计算中留下缺失值是一项有意义的研究决策。如果我们想要NA计算总数：

ds<- dsN %.%
#   dplyr::filter(!is.na(attendF)) %.%   # comment out to count NA in the total
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length( attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

     Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    23   100    0.23
    2   2001        Once or Twice     9   100    0.09
    3   2001 Less than once/month    16   100    0.16
    4   2001     About once/month    11   100    0.11
    5   2001    About twice/month     3   100    0.03
    6   2001      About once/week    21   100    0.21
    7   2001   Several times/week     9   100    0.09
    8   2001             Everyday     3   100    0.03
    9   2001                   NA     5   100    0.05
    10  2002                Never    17   100    0.17

缺失值用于计算要显示的总响应研究中的自然减员。

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

然而，假设消耗与之无明显关联结果衡量标准，看看它是多么相关是有意义的反应背书的流行程度会随着时间的推移而变化，或者可能会停留处于均衡状态。为此，我们需要从中删除缺失值计算答复总数：

ds<- dsN %.%
  dplyr::filter(!is.na(attendF)) %.%   # comment out to count NA in the total
  dplyr::group_by(yearF,attendF) %.%
  dplyr::summarize(count = length( attendF)) %.%
  dplyr::mutate(total = sum(count),
              percent= count/total)
head(ds,10)

    Source: local data frame [10 x 5]
    Groups: yearF

       yearF              attendF count total percent
    1   2001                Never    23    95 0.24211
    2   2001        Once or Twice     9    95 0.09474
    3   2001 Less than once/month    16    95 0.16842
    4   2001     About once/month    11    95 0.11579
    5   2001    About twice/month     3    95 0.03158
    6   2001      About once/week    21    95 0.22105
    7   2001   Several times/week     9    95 0.09474
    8   2001             Everyday     3    95 0.03158
    9   2002                Never    17    90 0.18889
    10  2002        Once or Twice    23    90 0.25556

图表相应地反映了这一点：

p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF))  #  keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
                         name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
                          limits=c(0, 1),
                          breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
                        limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p

enter image description here

谢谢，@ MrFlick！

从原始数据和汇总数据生成相同的图表

来自原始数据和汇总数据的相同图

问题：

2 个答案:

评论1回答