我已经发现了很多问题,但不知怎的,这对我没有帮助。我不明白如何在ggplot2中更改密度直方图中的binwidth,因此概率总和为1.看起来它只有在binwidth正好为1时才有效。 这是一个例子:
set.seed(1)
df = data.frame("data" = runif(1000, min=0, max=100))
a = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..density..),colour="black", fill = "white",
breaks=seq(0, 100, by = 50))
b = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y =..density..),
breaks=seq(0, 100, by = 30),
col="black",
fill="white")
c = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y =..density..),
breaks=seq(0, 100, by = 10),
col="black",
fill="white")
d = ggplot(data = df, aes(x = data))+
geom_histogram(aes(y =..density..),
breaks=seq(0, 100, by = 1),
col="black",
fill="white")
grid.arrange(a,b,c,d, ncol= 2)
如果查看概率轴,可以看到前三个图形必定是错误的。这些不是正确的直方图,因为箱子总和不是1.根据直方图a,b,c或d,y轴甚至没有显着变化。我还尝试用“binwidth”命令替换“break”命令,但它更糟糕。 我还想知道如何计算直方图单个区间的概率来证明它总和为1?
感谢您的帮助。
答案 0 :(得分:1)
模拟一些数据:
library(ggplot2)
library(dplyr)
set.seed(1)
df = data.frame("data" = runif(1000, min=0, max=100))
您可以得到的第一个情节是:
# y axis has the density estimate values
ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..density..),colour="black", fill = "white",
breaks=seq(0, 100, by = 50))
该图具有y轴的密度估计值。这些值对应于密度图,而不是您创建的柱。您可以看到覆盖密度图的此版本:
# y axis has the density estimate values and the density plot
ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..density..),colour="black", fill = "white",
breaks=seq(0, 100, by = 50)) +
geom_density(aes(data), col="red")
解释这一点的一种方法是红线上的每个点都有一个被选中的概率,而且是在y轴上(即很多点意味着概率往往接近于零)。
你可以用这个得到你想要的东西:
# y axis has the probabilities of each bar (bar counts / all counts)
ggplot(data = df, aes(x = data))+
geom_histogram(aes(y=..count../sum(..count..)),colour="black", fill = "white",
breaks=seq(0, 100, by = 50))
执行上述操作的另一种方法是,保留数据(以备将来使用或只检查概率总和为1)是这样的:
# assign the breaks
breaks = cut(df$data, seq(0, 100, by = 50))
# count observations in each bar and probability of each bar
df %>%
mutate(Breaks = breaks) %>%
count(Breaks) %>%
mutate(Prc = n/sum(n))
# # A tibble: 2 x 3
# Breaks n Prc
# <fctr> <int> <dbl>
# 1 (0,50] 520 0.52
# 2 (50,100] 480 0.48
# plot the above
df %>%
mutate(Breaks = breaks) %>%
count(Breaks) %>%
mutate(Prc = n/sum(n)) %>%
ggplot(aes(Breaks, Prc)) + geom_col()