我的数据有一个生物模态分布(下面的直方图),我想找到一种方法来分割它。 我显然可以通过眼睛做到这一点,但我有数百个类似的数据集,所以我想自动执行此操作。
> dput(dat[1:100])
structure(c(6.68586094706836, 0, 6.3578422665081, 6.3578422665081,
6.61338421837956, 0, 0, 6.39859493453521, 6.4377516497364, 0,
0, 0, 6.24027584517077, 6.46302945692067, 6.37842618365159, 6.30809844150953,
0, 6.44413125670044, 0, 0, 6.24027584517077, 6.58617165485467,
0, 0, 6.28599809450886, 6.45676965557216, 0, 0, 6.43133108193348,
6.45047042214418, 0, 6.49375383985169, 0, 6.34388043412633, 6.56385552653213,
6.94022246911964, 6.2709884318583, 6.78105762593618, 0, 6.32256523992728,
6.43133108193348, 6.36475075685191, 0, 6.5410299991899, 0, 0,
0, 0, 6.75343791859778, 6.34388043412633, 0, 0, 0, 6.26339826259162,
0, 6.37842618365159, 0, 6.45047042214418, 6.34388043412633, 0,
0, 6.84694313958538, 6.83410873881384, 6.62406522779989, 0, 6.4377516497364,
6.43133108193348, 0, 6.51767127291227, 6.46925031679577, 0, 6.67582322163485,
6.39859493453521, 6.90875477931522, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6.31535800152233, 0, 0, 0, 0,
0))
我被认为使用混合模型并使用我发现的以下代码。
library(mixtools)
simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) {
x1 <- rnorm(n.obs, mu[1], sd[1])
x2 <- rnorm(n.obs, mu[2], sd[2])
return(ifelse(runif(n.obs) < lambda, x1, x2))
}
x <- simulate()
model <- normalmixEM(x=x, k=2)
index.lower <- which.min(model$mu) # Index of component with lower mean
find.cutoff <- function(proba=0.5, i=index.lower) {
## Cutoff such that Pr[drawn from bad component] == proba
f <- function(x) {
proba - (model$lambda[i]*dnorm(x, model$mu[i], model$sigma[i]) /
(model$lambda[1]*dnorm(x, model$mu[1], model$sigma[1]) + model$lambda[2]*dnorm(x, model$mu[2], model$sigma[2])))
}
return(uniroot(f=f, lower=-10, upper=10)$root) # Careful with division by zero if changing lower and upper
}
cutoffs <- c(find.cutoff(proba=0.5), find.cutoff(proba=0.75)) # Around c(1.8, 1.5)
hist(x)
abline(v=cutoffs, col=c("red", "blue"), lty=2)
但是,我收到了这个错误。
One of the variances is going to zero; trying new starting values.
我想这是因为它们在0的样本中没有变化,样本中的差异只有6左右。(注意,较低值的峰值不会总是为零,但通常会为零)
有没有办法绕过这个或我应该使用的任何其他方法? 谢谢,
答案 0 :(得分:0)
也许您可以获取hist
的输出并在输出中使用类似pastecs::turnpoints
的内容。
foo<-hist(your_data,your_arguments)
mins <- turnpoints(foo$values)$pits
然后“备份”并识别与该最小值对应的foo$breaks
或foo$mids
。