Question

我的数据有一个生物模态分布（下面的直方图），我想找到一种方法来分割它。我显然可以通过眼睛做到这一点，但我有数百个类似的数据集，所以我想自动执行此操作。

histogram of my data

> dput(dat[1:100])
structure(c(6.68586094706836, 0, 6.3578422665081, 6.3578422665081, 
6.61338421837956, 0, 0, 6.39859493453521, 6.4377516497364, 0, 
0, 0, 6.24027584517077, 6.46302945692067, 6.37842618365159, 6.30809844150953, 
0, 6.44413125670044, 0, 0, 6.24027584517077, 6.58617165485467, 
0, 0, 6.28599809450886, 6.45676965557216, 0, 0, 6.43133108193348, 
6.45047042214418, 0, 6.49375383985169, 0, 6.34388043412633, 6.56385552653213, 
6.94022246911964, 6.2709884318583, 6.78105762593618, 0, 6.32256523992728, 
6.43133108193348, 6.36475075685191, 0, 6.5410299991899, 0, 0, 
0, 0, 6.75343791859778, 6.34388043412633, 0, 0, 0, 6.26339826259162, 
0, 6.37842618365159, 0, 6.45047042214418, 6.34388043412633, 0, 
0, 6.84694313958538, 6.83410873881384, 6.62406522779989, 0, 6.4377516497364, 
6.43133108193348, 0, 6.51767127291227, 6.46925031679577, 0, 6.67582322163485, 
6.39859493453521, 6.90875477931522, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6.31535800152233, 0, 0, 0, 0, 
0))

我被认为使用混合模型并使用我发现的以下代码。

library(mixtools)

simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) {
x1 <- rnorm(n.obs, mu[1], sd[1])
x2 <- rnorm(n.obs, mu[2], sd[2])    
return(ifelse(runif(n.obs) < lambda, x1, x2))
}
x <- simulate()
model <- normalmixEM(x=x, k=2)
index.lower <- which.min(model$mu)  # Index of component with lower mean

find.cutoff <- function(proba=0.5, i=index.lower) {
## Cutoff such that Pr[drawn from bad component] == proba
f <- function(x) {
    proba - (model$lambda[i]*dnorm(x, model$mu[i], model$sigma[i]) /
                 (model$lambda[1]*dnorm(x, model$mu[1], model$sigma[1]) + model$lambda[2]*dnorm(x, model$mu[2], model$sigma[2])))
    }
    return(uniroot(f=f, lower=-10, upper=10)$root)  # Careful with division by zero if changing lower and upper
}

cutoffs <- c(find.cutoff(proba=0.5), find.cutoff(proba=0.75))  # Around c(1.8, 1.5)

hist(x)
abline(v=cutoffs, col=c("red", "blue"), lty=2)

但是，我收到了这个错误。

One of the variances is going to zero;  trying new starting values.

我想这是因为它们在0的样本中没有变化，样本中的差异只有6左右。（注意，较低值的峰值不会总是为零，但通常会为零）

有没有办法绕过这个或我应该使用的任何其他方法？谢谢，

Answer 1

也许您可以获取hist的输出并在输出中使用类似pastecs::turnpoints的内容。

foo<-hist(your_data,your_arguments)
mins <- turnpoints(foo$values)$pits

然后“备份”并识别与该最小值对应的foo$breaks或foo$mids。

拆分双峰分布

1 个答案: