R ggplot2直方图叠加,每个直方图具有标准化值

时间:2018-02-22 08:37:24

标签: r ggplot2 histogram normalize

我想创建一个比较三组的直方图。但是,我想用每组中的总计数来标准化每个直方图,而不是计数总数。这是我的代码。

library(ggplot2)
library(reshape2)
# Creates dataset
set.seed(9)
df<- data.frame(values = c(runif(400,20,50),runif(300,40,80),runif(600,0,30)),labels = c(rep("med",400),rep("high",300),rep("low",600)))

levs <- c("low", "med", "high")
df$labels <- factor(df$labels, levels = levs)

ggplot(df, aes(x=values, fill=labels)) + 
    geom_histogram(aes(y=..density..), 
                   breaks= seq(0, 80, by = 2),
                   alpha=0.2, 
                   position="identity")

生成直方图,其似乎通过密度归一化。 enter image description here

但是,我决定根据我对该密度的手动验证来交叉检查这个密度图。为此,我使用了以下代码:

# Separates the low medium and high groups
df1 <- df[df$labels == "low",]
df2 <- df[df$labels == "med",]
df3 <- df[df$labels == "high",]

# creates histogram for each group that is normalized by the total number of counts
hist_temp <- hist(df1$values, breaks=seq(0,80, by=2))
    tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts)
    colnames(tdf) <- c("bins","counts")
    tdf$norm <- tdf$counts/(sum(tdf$counts))
        low1 <- tdf

hist_temp <- hist(df2$values, breaks=seq(0,80, by=2))
    tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts)
    colnames(tdf) <- c("bins","counts")
    tdf$norm <- tdf$counts/(sum(tdf$counts))
        med1 <- tdf

hist_temp <- hist(df3$values, breaks=seq(0,80, by=2))
    tdf <- data.frame(hist_temp$breaks[2:length(hist_temp$breaks)],hist_temp$counts)
    colnames(tdf) <- c("bins","counts")
    tdf$norm <- tdf$counts/(sum(tdf$counts))
        high1 <- tdf

# Combines normalized histograms for each data frame and melts them into a single vector for plotting
Tdata <- data.frame(low1$bins,low1$norm,med1$norm,high1$norm)
    colnames(Tdata) <- c("bin","low", "med", "high")
    Tdata<- melt(Tdata,id = "bin")

levs <- c("low", "med", "high")
Tdata$variable <- factor(Tdata$variable, levels = levs)

# Plot the data
ggplot(Tdata, aes(group=variable, colour= variable)) + 
    geom_line(aes(x = bin, y = value))

哪个生成: enter image description here

正如你所看到的那些彼此截然不同,我无法弄清楚原因。两个Y轴应该相同,但事实并非如此。因此,假设我没有做一些愚蠢的数学错误,我相信我希望直方图看起来像线图,我无法找到一种方法来实现这一点。任何帮助表示赞赏,并提前感谢您。

编辑以添加不起作用的更多示例:

我也试过用这个代码使用..count ../(sum(.. count ..))方法:

# Histogram where each histogram is divided by the total count of all groups    
    ggplot(df, aes(x=values, fill=labels, group=labels)) + 
        geom_histogram(aes(y=(..count../sum(..count..))), 
                       breaks= seq(0, 80, by = 2),
                       alpha=0.2, 
                       position="identity")

这些结果: enter image description here

这只是标准化为所有直方图的总数。这也没有反映我在线图中看到的内容。另外,我已经尝试用.ncount ..替换..count ..(在分子,分母,分子和分母中)并且也不会重新创建折线图中显示的结果。

此外,我尝试使用“position = stack”而不是身份使用以下代码:

    ggplot(df, aes(x=values, fill=labels, group=labels)) + 
        geom_histogram(aes(y=..density..), 
                       breaks= seq(0, 80, by = 2),
                       alpha=0.2, 
                       position="stack")

得到了这个结果: enter image description here

这也不反映折线图中显示的值。

进步!使用this post by Joran中概述的方法,我现在可以生成与折线图相同的直方图。以下是代码:

# Plot where each histogram is normalized by its own counts.  
ggplot(df,aes(x=values, fill=labels, group=labels)) + 
    geom_histogram(data=subset(df, labels == 'high'),
                   aes(y=(..count../sum(..count..))), 
                   breaks= seq(0, 80, by = 2),
                   alpha = 0.2) + 
    geom_histogram(data=subset(df, labels == 'med'),
                   aes(y=(..count../sum(..count..))), 
                   breaks= seq(0, 80, by = 2),
                   alpha = 0.2) +
    geom_histogram(data=subset(df, labels == 'low'),
                   aes(y=(..count../sum(..count..))), 
                   breaks= seq(0, 80, by = 2),
                   alpha = 0.2) +
    scale_fill_manual(values = c("blue","red","green"))

生成此图表: enter image description here

但是,我仍然无法重新排序数据,因此图例会显示“低”,然后是“med”,然后是“高”,而不是字母顺序。我已经设定了因素的水平。 (见第一段代码)。有什么想法吗?

1 个答案:

答案 0 :(得分:0)

要使用每个类别的计数,可能是position="stack"

ggplot(df, aes(x=values, fill=labels)) + 
  geom_histogram(aes(y=..density..), 
                 breaks= seq(0, 80, by = 2),
                 alpha=0.4, 
                 position="stack") +
  geom_density(alpha=.2, position="stack")

它给了我这个distribution,但似乎仍然不同于你的第二个情节。