对于以下数据结构
dsN<-data.frame(
id=rep(1:100, each=4),
yearF=factor(rep(2001:2004, 100)),
attendF=sample(1:8, 400, T, c(.2,.2,.15,.10,.10, .20, .15, .02))
)
dsN[sample(which(dsN$yearF==2001), 5), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2002), 10), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2003), 15), "attendF"]<-NA
dsN[sample(which(dsN$yearF==2004), 20), "attendF"]<-NA
attcol8<-c("Never"="#4575b4",
"Once or Twice"="#74add1",
"Less than once/month"="#abd9e9",
"About once/month"="#e0f3f8",
"About twice/month"="#fee090",
"About once/week"="#fdae61",
"Several times/week"="#f46d43",
"Everyday"="#d73027")
dsN$attendF<-factor(dsN$attendF, levels=1:8, labels=names(attcol8))
head(dsN,13)
id yearF attendF
1 1 2001 About once/week
2 1 2002 About once/month
3 1 2003 About once/week
4 1 2004 <NA>
5 2 2001 Less than once/month
6 2 2002 About once/week
7 2 2003 About once/week
8 2 2004 Several times/week
9 3 2001 Once or Twice
10 3 2002 About once/week
11 3 2003 <NA>
12 3 2004 Once or Twice
13 4 2001 Several times/week
我们可以获得一系列堆积条形图
require(ggplot2)
# p<- ggplot( subset(dsN,!is.na(attendF)), aes(x=yearF, fill=attendF)) # leaving NA out of
p<- ggplot( dsN, aes(x=yearF, fill=attendF)) # keeping NA in calculations
p<- p+ geom_bar(position="fill")
p<- p+ scale_fill_manual(values = attcol8,
name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
limits=c(0, 1),
breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p
上图是根据原始数据生成的。但是,有时候 方便从汇总数据生成图表,特别是如果一个 需要控制统计功能。下面是dsN的转换 到只包含实际映射到的值的ds 上图:
require(dplyr)
ds<- dsN %.%
dplyr::filter(!is.na(attendF)) %.%
dplyr::group_by(yearF,attendF) %.%
dplyr::summarize(count = sum(attendF)) %.%
dplyr::mutate(total = sum(count),
percent= count/total)
head(ds,10)
Source: local data frame [10 x 5]
Groups: yearF
yearF attendF count total percent
1 2001 Never 18 373 0.04826
2 2001 Once or Twice 36 373 0.09651
3 2001 Less than once/month 30 373 0.08043
4 2001 About once/month 32 373 0.08579
5 2001 About twice/month 40 373 0.10724
6 2001 About once/week 90 373 0.24129
7 2001 Several times/week 119 373 0.31903
8 2001 Everyday 8 373 0.02145
9 2002 Never 11 355 0.03099
10 2002 Once or Twice 44 355 0.12394
# verify
summarize(filter(ds, yearF==2001), should.be.one=sum(percent))
```
Source: local data frame [1 x 2]
yearF should.be.one
1 2001 1
如何使用此摘要数据集从上方重新创建图形
ds
?
答案 0 :(得分:2)
嗯,部分问题是您的摘要不正确。如果要在总计中正确计算NA值,则需要将NA值保留在那里。也许试试
ds<- dsN %.%
dplyr::group_by(yearF,attendF) %.%
dplyr::summarize(count = length(attendF)) %.%
dplyr::mutate(total = sum(count, na.rm=T),
percent= count/total)
然后,要使用汇总数据,您只需稍微更改前两行
p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF)) # keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
请注意,我们添加了一个特定的y
值,我们告诉geom_bar使用stat="identity"
,这意味着使用我们提供的实际y
值作为高度。他们会产生相同的图像
答案 1 :(得分:0)
正如@MrFlick指出的那样,错误发生在summarize()中的计算公式中。但是,是否在总计算中留下缺失值是一项有意义的研究决策。
如果我们想要NA
计算总数:
ds<- dsN %.%
# dplyr::filter(!is.na(attendF)) %.% # comment out to count NA in the total
dplyr::group_by(yearF,attendF) %.%
dplyr::summarize(count = length( attendF)) %.%
dplyr::mutate(total = sum(count),
percent= count/total)
head(ds,10)
Source: local data frame [10 x 5]
Groups: yearF
yearF attendF count total percent
1 2001 Never 23 100 0.23
2 2001 Once or Twice 9 100 0.09
3 2001 Less than once/month 16 100 0.16
4 2001 About once/month 11 100 0.11
5 2001 About twice/month 3 100 0.03
6 2001 About once/week 21 100 0.21
7 2001 Several times/week 9 100 0.09
8 2001 Everyday 3 100 0.03
9 2001 NA 5 100 0.05
10 2002 Never 17 100 0.17
缺失值用于计算要显示的总响应 研究中的自然减员。
p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF)) # keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
limits=c(0, 1),
breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p
然而,假设消耗与之无明显关联 结果衡量标准,看看它是多么相关是有意义的 反应背书的流行程度会随着时间的推移而变化,或者可能会停留 处于均衡状态。为此,我们需要从中删除缺失值 计算答复总数:
ds<- dsN %.%
dplyr::filter(!is.na(attendF)) %.% # comment out to count NA in the total
dplyr::group_by(yearF,attendF) %.%
dplyr::summarize(count = length( attendF)) %.%
dplyr::mutate(total = sum(count),
percent= count/total)
head(ds,10)
Source: local data frame [10 x 5]
Groups: yearF
yearF attendF count total percent
1 2001 Never 23 95 0.24211
2 2001 Once or Twice 9 95 0.09474
3 2001 Less than once/month 16 95 0.16842
4 2001 About once/month 11 95 0.11579
5 2001 About twice/month 3 95 0.03158
6 2001 About once/week 21 95 0.22105
7 2001 Several times/week 9 95 0.09474
8 2001 Everyday 3 95 0.03158
9 2002 Never 17 90 0.18889
10 2002 Once or Twice 23 90 0.25556
图表相应地反映了这一点:
p<- ggplot( ds, aes(x=yearF, y=percent, fill=attendF)) # keeping NA in calculations
p<- p+ geom_bar(position="stack", stat="identity")
p<- p+ scale_fill_manual(values = attcol8,
name="Response category" )
p<- p+ scale_y_continuous("Prevalence: proportion of total",
limits=c(0, 1),
breaks=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1))
p<- p+ scale_x_discrete("Waves of measurement",
limits=as.character(c(2000:2005)))
p<- p+ labs(title=paste0("In the past year, how often have you attended a worship service?"))
p
谢谢,@ MrFlick!