根据组为直方图着色时防止错误的密度图

时间:2019-01-24 17:15:26

标签: r ggplot2 histogram density-plot

基于一些虚拟数据,我创建了带有欲望图的直方图

set.seed(1234)
wdata = data.frame(
  sex = factor(rep(c("F", "M"), each=200)),
  weight = c(rnorm(200, 55), rnorm(200, 58))
)
a <- ggplot(wdata, aes(x = weight))

a + geom_histogram(aes(y = ..density..,
                       # color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               # aes(color = sex)
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Basic Result

weight的直方图应与sex对应,因此我将aes(y = ..density.., color = sex)用于geom_histogram()

a + geom_histogram(aes(y = ..density..,
                       color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               # aes(color = sex)
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Scaled individual histograms (not desired)

正如我所希望的那样,密度图保持不变(两组都相同),但是直方图会按比例增加(并且现在似乎已经被单独对待):

如何防止这种情况发生?我需要单独着色的直方图条,但需要所有着色组的联合密度图。

P.S。 为aes(color = sex)使用geom_density()可使一切恢复到原始比例-但我不希望使用单独的密度图(如下所示):

a + geom_histogram(aes(y = ..density..,
                       color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               aes(color = sex)
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

Individual densities (not desired)

编辑:

正如已经建议的那样,用geom_histogram()除以y = ..density../2的美学中的组数可以近似得出解决方案。但是,这仅适用于对称分布,如下面的第一个输出所示:

a + geom_histogram(aes(y = ..density../2,
                       color = sex
                       ), 
                   colour="black",
                   fill="white",
                   position = "identity") +
  geom_density(alpha = 0.2,
               ) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF"))

产生

Solution

但是,使用这种方法时,对称分布较少可能会引起麻烦。参见以下内容,其中5个组使用y = ..density../5。首先是原稿,然后是操纵(使用position = "stack"): Original

Divided by 5

由于左侧的分布较重,因此左侧除以5的低估为准,而右侧则高估了。

编辑2:解决方案

如安德鲁(Andrew)所建议,以下(完整的)代码解决了该问题:

library(ggplot2)
set.seed(1234)
wdata = data.frame(
  sex = factor(rep(c("F", "M"), each = 200)),
  weight = c(rnorm(200, 55), rnorm(200, 58))
)

binwidth <- 0.25
a <- ggplot(wdata,
            aes(x = weight,
                # Pass binwidth to aes() so it will be found in
                # geom_histogram()'s aes() later
                binwidth = binwidth))

# Basic plot w/o colouring according to 'sex'
a + geom_histogram(aes(y = ..density..),
                   binwidth = binwidth,
                   colour = "black",
                   fill = "white",
                   position = "stack") +
  geom_density(alpha = 0.2) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF")) +
  # Use fixed scale for sake of comparability
  scale_x_continuous(limits = c(52, 61)) +
  scale_y_continuous(limits = c(0, 0.25))


# Plot w/ colouring according to 'sex'
a + geom_histogram(aes(x = weight,
                       # binwidth will only be found if passed to
                       # ggplot()'s aes() (as above)
                       y = ..count.. / (sum(..count..) * binwidth),
                       color = sex),
                   binwidth = binwidth,
                   fill="white",
                   position = "stack") +
  geom_density(alpha = 0.2) +
  scale_color_manual(values = c("#868686FF", "#EFC000FF")) +
  # Use fixed scale for sake of comparability
  scale_x_continuous(limits = c(52, 61)) +
  scale_y_continuous(limits = c(0, 0.25)) +
  guides(color = FALSE)

注意: 必须将binwidth = binwidth传递到ggplot()的{​​{1}},否则aes()的{​​{1}}将找不到预先指定的binwidth 。此外,指定了geom_histogram(),以便直方图的两个版本都是可比较的。伪数据和更复杂的分布图,如下所示:

Correct, ungrouped, simple data

Correct, grouped, simple data

Correct, ungrouped, more complex distribution

Correct, grouped, more complex distribution

已解决-谢谢您的帮助!

1 个答案:

答案 0 :(得分:1)

我认为您无法使用y=..density..来完成此操作,但是您可以像这样重新创建相同的内容...

binwidth <- 0.25 #easiest to set this manually so that you know what it is

a + geom_histogram(aes(y = ..count.. / (sum(..count..) * binwidth),
                       color = sex), 
                   binwidth = binwidth,
                   fill="white",
                   position = "identity") +
    geom_density(alpha = 0.2) +
    scale_color_manual(values = c("#868686FF", "#EFC000FF"))

enter image description here