Scipy集群二进制数据和标签

时间:2014-04-21 08:12:33

标签: python scipy cluster-analysis data-mining k-means

我试图在二进制数据集上进行k-means聚类。以下矩阵基于网页访问(' 1'用于访问,' 0'用于不访问)。第一列是标识每个用户的标签。

0,1,1,0,1,0,1,0,1,1,0
1,1,0,0,1,1,0,1,0,1,0
2,1,0,0,0,1,0,1,0,1,1
3,1,0,1,0,1,0,0,0,1,0
4,0,1,1,1,0,1,0,1,0,0
5,1,1,0,0,1,0,1,1,1,1
6,0,0,1,0,1,1,0,1,0,0
7,1,1,0,1,0,1,0,0,1,0
8,1,0,0,0,1,0,1,1,1,1
9,0,1,1,0,1,0,1,0,0,0

我使用scipy k-means并遵循this教程。最后,我想知道每个用户属于哪个集群。例如:如果k = 3

0 - cluster_1
1 - cluster_0
2 - cluster_1
3 - cluster_3
.. - .... 

以下是我尝试过的,似乎二进制数据没有正确聚类。这可以改进以获得我的预期输出吗?

import numpy as np
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = np.array([[1,0,0,1,1,0,1,0,1,0],
[1,0,0,0,1,0,1,0,1,1],
[1,0,1,0,1,0,0,0,1,0],
[0,1,1,1,0,1,0,1,0,0],
[1,1,0,0,1,0,1,1,1,1],
[0,0,1,0,1,1,0,1,0,0],
[1,1,0,1,0,1,0,0,1,0],
[1,0,0,0,1,0,1,1,1,1],
[0,1,1,0,1,0,1,0,0,0],
[1,1,0,1,0,1,0,1,1,0]])

centroids,_ = kmeans(data,2)
idx,_ = vq(data,centroids)
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

1 个答案:

答案 0 :(得分:0)

请阅读更多文档,不要只从网上复制和粘贴代码。

idx,_ = vq(data,centroids)

你看过idx是什么吗?