通过KMeans聚类确定双峰分布的阈值

时间:2017-02-10 01:22:56

标签: python scikit-learn cluster-analysis

我想找到双峰分布的阈值。例如,双峰分布可能如下所示:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000; b = n//10; i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)

bimodal_histogram

尝试找到群集中心不起作用,因为我不确定矩阵h应该如何格式化:

from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T  # because h[0] and h[1] have different sizes.
kmeans = KMeans(n_clusters=2).fit(h)

我希望能够在-2和2附近找到聚类中心。然后,阈值将是两个聚类中心的中点。

1 个答案:

答案 0 :(得分:1)

你的问题对我来说有点混乱,所以如果我的解释不正确,请告诉我。我认为你基本上都在努力做一维kmeans,并尝试引入频率作为第二维来让KMeans起作用,但我真的很高兴[-2,2]作为中心的输出而不是[(-2,y1), (2,y2)]

要做一维kmeans,您只需将数据重塑为1长度向量的n(类似问题:Scikit-learn: How to run KMeans on a one-dimensional array?

代码:

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(45)
n = 1000;
b = n//10;
i = np.random.randint(0,2,n)
x = i*np.random.normal(-2.0,0.8,n) + (1-i)*np.random.normal(2.0,0.8,n)
_ = plt.hist(x,bins=b)

from sklearn.cluster import KMeans
h = np.histogram(x,bins=b)
h = np.vstack((0.5*(h[1][:-1]+h[1][1:]),h[0])).T  # because h[0] and h[1] have different sizes.

kmeans = KMeans(n_clusters=2).fit(x.reshape(n,1))
print kmeans.cluster_centers_

输出:

[[-1.9896414]
 [ 2.0176039]]