Question

具体的例子是，假设x是0到10之间的一些连续变量，而红线是＆＃34;货物＆＃34;的分布。而蓝色是坏事＆＃34;，我想看看将这个变量纳入检查“善”的价值是否有价值。但我想首先量化蓝色＆gt;区域内的东西数量。红色

因为这是一个分布图，尺度看起来相同，但实际上我的样本中有98倍的好处使事情变得复杂，因为它实际上并不只是测量曲线下的面积，而是测量不良样本，它的分布沿着大于红色的线。

我一直在努力学习R，但我甚至不确定如何接近这个，任何帮助表示赞赏。 enter image description here

EDIT 样本数据： http://pastebin.com/7L3Xc2KU＆lt; - 基本上是几百万行。

使用

创建图表

graph <- qplot(sample_x, bad_is_1, data=sample_data, geom="density", color=bid_is_1)

Answer 1

我能想到的唯一方法是使用简单的梯形计算曲线之间的面积。首先，我们手动计算密度

d0 <- density(sample$sample_x[sample$bad_is_1==0])
d1 <- density(sample$sample_x[sample$bad_is_1==1])

现在我们创建将在我们观察到的密度点之间进行插值的函数

f0 <- approxfun(d0$x, d0$y)
f1 <- approxfun(d1$x, d1$y)

接下来，我们找到密度重叠的x范围

ovrng <- c(max(min(d0$x), min(d1$x)), min(max(d0$x), max(d1$x)))

并将其划分为500个部分

i <- seq(min(ovrng), max(ovrng), length.out=500)

现在我们计算密度曲线之间的距离

h <- f0(i)-f1(i)

并使用梯形区域的公式，我们将d1> d0

区域的面积相加

area<-sum( (h[-1]+h[-length(h)]) /2 *diff(i) *(h[-1]>=0+0))
# [1] 0.1957627

我们可以使用

绘制区域

plot(d0, main="d0=black, d1=green")
lines(d1, col="green")
jj<-which(h>0 & seq_along(h) %% 5==0); j<-i[jj]; 
segments(j, f1(j), j, f1(j)+h[jj])

enter image description here

Answer 2

这是一种遮蔽两个密度图之间区域的方法，并计算该区域的大小。

# Create some fake data
set.seed(10)
dat = data.frame(x=c(rnorm(1000, 0, 5), rnorm(2000, 0, 1)), 
                 group=c(rep("Bad", 1000), rep("Good", 2000)))

# Plot densities
# Use y=..count.. to get counts on the vertical axis
p1 = ggplot(dat) +
       geom_density(aes(x=x, y=..count.., colour=group), lwd=1)

一些额外的计算，以遮蔽两个密度图之间的区域（改编自this SO question）：

pp1 = ggplot_build(p1)

# Create a new data frame with densities for the two groups ("Bad" and "Good")
dat2 = data.frame(x = pp1$data[[1]]$x[pp1$data[[1]]$group==1],
                 ymin=pp1$data[[1]]$y[pp1$data[[1]]$group==1],
                 ymax=pp1$data[[1]]$y[pp1$data[[1]]$group==2])

# We want ymax and ymin to differ only when the density of "Good" 
# is greater than the density of "Bad"
dat2$ymax[dat2$ymax < dat2$ymin] = dat2$ymin[dat2$ymax < dat2$ymin]

# Shade the area between "Good" and "Bad"
p1a = p1 +  
    geom_ribbon(data=dat2, aes(x=x, ymin=ymin, ymax=ymax), fill='yellow', alpha=0.5)

以下是两个图：

enter image description here

要获取Good和Bad特定范围内的区域（值的数量），请在每个组上使用density功能（或者您可以继续使用拉取的数据来自ggplot，如上所述，但这样可以更直接地控制密度分布的生成方式）：

## Calculate densities for Bad and Good. 
# Use same number of points and same x-range for each group, so that the density 
# values will line up. Use a higher value for n to get a finer x-grid for the density
# values. Use a power of 2 for n, because the density function rounds up to the nearest 
# power of 2 anyway.
bad = density(dat$x[dat$group=="Bad"], 
             n=1024, from=min(dat$x), to=max(dat$x))
good = density(dat$x[dat$group=="Good"], 
             n=1024, from=min(dat$x), to=max(dat$x))

## Normalize so that densities sum to number of rows in each group

# Number of rows in each group
counts = tapply(dat$x, dat$group, length)

bad$y = counts[1]/sum(bad$y) * bad$y
good$y = counts[2]/sum(good$y) * good$y

## Results

# Number of "Good" in region where "Good" exceeds "Bad"
sum(good$y[good$y > bad$y])
[1] 1931.495  # Out of 2000 total in the data frame

# Number of "Bad" in region where "Good" exceeds "Bad"
sum(bad$y[good$y > bad$y])
[1] 317.7315  # Out of 1000 total in the data frame

如何测量R / ggplot2中2条分布曲线之间的面积

2 个答案: