ggplot2 geom_histogram从采样具有2个加权分布的混合物分布中绘制Desity条

时间:2018-09-06 06:31:51

标签: r ggplot2 mixture-model

首先,我得到了2个混合分布(它们具有混合的一部分),并且我知道样本来自哪个分布。 然后我要根据样品的密度和混合物的分布绘制直方图。

我们先来看代码(段1 ):


# two components
set.seed(1)    # for reproducible example
b1 <- rnorm(900000, mean=8, sd=2) # samples
b2 <- rnorm(100000, mean=17, sd=2)

# densities corresponding to samples
d = dnorm(c(b1, b2), mean = 8, sd = 2)*.9 + dnorm(c(b1, b2), mean = 17, sd = 2)*.1 

# ground truth
b <- data.frame(ss=c(b1,b2), dd=d, gg=factor(c(rep(1, length(b1)), rep(2, length(b2))))) 

# sample from mixed distribution
c <- b[sample(nrow(b), 500000),] 

ggplot(data = c, aes(x = ss)) +
  geom_histogram(aes(y = stat(density)), binwidth = .5, alpha = .3, position="identity") +
  geom_line(data = c, aes(x = ss, y = dd), color = "red", inherit.aes = FALSE)

此结果很好:like this

但是我想根据样品组填充颜色。因此,我更改了代码( seg 2 ):

ggplot(data=c, aes(x=ss)) +
  geom_histogram(aes(y=stat(density), fill=gg, color=gg), 
                 binwidth=.5, alpha=.3, position="identity") +
  geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)

结果错误。 R分别计算两个部分的密度。因此,这两个部分的高度相同。

然后我发现了一些方法,例如this段3 ):

breaks = seq(min(c$ss), max(c$ss), .5) # form cut points
bins1 = cut(with(c, ss[gg==1]), breaks) # form intervals by cutting
bins2 = cut(with(c, ss[gg==2]), breaks)
cnt1 = sapply(split(with(c, ss[gg==1]), bins1), length) # assign points to its interval
cnt2 = sapply(split(with(c, ss[gg==2]), bins2), length)
h = data.frame(
  x = head(breaks, -1)+.25,
  dens1 = cnt1/sum(cnt1,cnt2), # height of density bar
  dens2 = cnt2/sum(cnt1,cnt2)
  # weight = sapply(split(samples.mixgamma$samples, bins), sum)
ggplot(h) +
  geom_bar(aes(x, dens1), fill="red", alpha = .3, stat="identity") +
  geom_bar(aes(x, dens2), fill="blue", alpha = .3, stat="identity") +
  geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)

或这样设置y=stat(count)/sum(stat(count))段4 ):

ggplot(data=c, aes(x=ss)) +
  geom_histogram(aes(y=stat(count)/sum(stat(count)), fill=gg, color=gg), 
                 binwidth=.5, alpha=.3, position="identity") +
  geom_line(data=c, aes(x=ss, y=dd), color="red", inherit.aes=FALSE)


因此,如果我想用seg 2之类的混合物和seg 1之类的正确比例填充不同颜色的两组,并避免seg 3和seg 4之类的错误,该怎么办?



解决方案是:概率密度应计算为y=stat(count)/.5/sum(stat(count))。我只进行规范化,而不用质量除以体积。因此,thisseg 3之类的答案需要修改

0 个答案:
