聚类使用基于距离的方法进行分类数据集

时间:2015-05-13 10:03:23

标签: python machine-learning cluster-analysis k-means

我想将ROCK聚类算法与基于距离的算法进行比较。假设我们有(m)训练样例和(n)特征

ROCK:

根据我的理解,ROCK所做的就是那个

1. It calculates a similarity matrix (m*m) using Jaccard cooficients.
2. Then a threshold value is provided by the user.
3. Based on the threshold value it then links the data points and the data-points that have more neighbors in common are said to be in same cluster. 
  For Example lets see the below png file,

enter image description here

   The above picture shows the similarity matrix and let threshold_value =0.2. 
4. The algorithm then computes links between the points, which flows in as below.
for A- A   (Only A exceeds the threshold value) 
    B- BCD (because bb, bc and bd exceeds the threshold value)
    C- BCD
    D- BCD
Now since B, C and D each have common neighbor of 3 they are grouped into the same cluster

Therefore we get two clusters {A}, {BCD}

基于距离的方法:

1. I take a different approach, but like ROCK even I create the similarity matrix.
2. Even I compute the initial links like,
   for A- A   (Only A exceeds the threshold value) 
       B- BCD (because bb, bc and bd exceeds the threshold value)
       C- BCD
       D- BCD
3. Now Instead of finding neighbors, I perform some mojo and find the best centroids.
4. After finding the centroid, I run the k-means clustering algorithm over the similarity matrix(m*m)
5. Since I find the centroids before hand, the time taken by the algorithm reduces by not running the k-means algorithm multiple times for randomly chosen centroids.

问题陈述:

我看到的问题是空间复杂度,因为相似性矩阵是一个(m * m)矩阵,如果m的值太大,比如100万,那么存储这样大的矩阵将是困难的,也是由于矩阵欧几里德距离计算的大小需要时间。

在ROCK中我相信,绝对没有必要存储矩阵,因为当在数据集之间计算Jaccard cooficient时,链接可以在运行中构建。

我在基于距离的算法方法上运行了蘑菇数据集的相似度矩阵 (uci.org)并且输出结果与ROCK非常相似,对于其他一些数据集甚至更好。

Qestions:

1. Is my understanding of ROCK correct.
2. Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances. 
3. I would really appreciate if someone could provide the big O complexity for the distance based approach. 

感谢:)

1 个答案:

答案 0 :(得分:1)

根据我的知识,随着大小的增加,群集变得非常耗费内存,您必须找到一种方法来减少数据的维度。

我对ROCK并不熟悉,但在我必须集群数百万个文档之前,我一直致力于集群问题。

Distance Calculation Metric : levenshtein distance
Clustering Algorithm : DBSCAN

回到你的问题

Question 2 : 
Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances.

我绝不会建议构建一个大矩阵。例如,在100万个字上构建距离矩阵将需要4TB的空间。您将不得不使用某种阻塞技术对某些类似的文档进行分组,然后在顶部应用聚类算法。

3. I would really appreciate if someone could provide the big O complexity for the distance based approach. 

通常,计算两个单词之间距离的时间复杂度是微不足道的,因为单词不会太长。你的复杂性就是比较次数,所以如果你有n个单词,那么时间复杂度就是O(n * n)