在Apache Spark中平分Kmeans集群指数

时间:2018-03-13 20:42:46

标签: apache-spark

Apache Spark Bisecting Kmeans来自Linkage Matrix的树形图

我试图在Spark中生成Bisecting Kmeans Clustering结果的树形图。我在网上找到了这个问题的一些变体,例如here,并且有一个JIRA请求here。但是我找不到其他有工作解决方案的人。

为了尝试实现这一点,我使用yu-iksw的toLinkageMatrix function为Spark编译了Spark MLlib 2.2.0,并对日志输出进行了一些更改,以生成有关Bisecting Clustering Selection Process的更多信息。我已经上传了这个Jar,其中包含一个样本SBT构建用于测试目的,因此任何有兴趣帮助谁不想从源代码重建Spark MLlib的人都可以运行自己的测试。您可以在my github repo上的sbt构建中看到,mllib和mllib-local jar位于/ lib文件夹中。

要绘制我的测试链接矩阵输出,我使用jupyter-notebook并手动将Spark Linkage输出传递给scipy-dendogram。 jupyter笔记本也在my github repo here

简而言之,当我使用3-4个集群时,使用Iris数据集的测试输出似乎有效,但是当我尝试5个或更多个集群时,链接矩阵无法生成有效的集群索引。我已经尝试了一些不同的方法来解决这个问题,比如改变toLinkageMatrix选择过程和它调用的Array函数,但没有用。

我对Bisecting K-Means聚类有一个不错的概念性理解,但我很难跟踪Spark中链接矩阵失败的确切/原因。如果你看一下我的spark-notebook HTML,你也可以在my github repo上看到我的完整火花代码。

用于编译的完整Spark 2.2.0源代码也在my repo here

更改的主要源文件是here mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala here mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala

以下是使用3和10个群集的群集输出。

  

注意:我还重新编写了scipy树状图测试功能,以准确说明绘图时z连锁失效的确切位置和原因。

虹膜数据,3个群集输出


    COST
    84.20375254574043

    CLUSTER CENTERS
    Sepal_Length    Sepal_Width Petal_Length    Petal_Width
    5.01    3.37    1.56    0.29
    5.95    2.77    4.45    1.45
    6.85    3.07    5.74    2.07

    ADJACENCY MATRIX
    FromNodeID  toNodeID    distance
    0   1   2.540618378626947
    0   2   2.540618378626947
    2   3   1.044390196577994
    2   4   1.044390196577994

    LOG OUTPUTS
    Feature dimension: 4.
    Number of points: 150.
    Initial cost: 681.3705999999911.
    The minimum number of points of a divisible cluster is 1.
    Dividing 1 clusters on level 1.
    Dividing 1 clusters on level 2.
    The divisible clusters needed for this iteration were : d = 1, cost =681.3705999999911, size = 150
    The divisible clusters needed for this iteration were : d = 3, cost =123.79587628866193, size = 97

    LINKAGE MATRIX 
    node1   node2   distance    tree_size
    1   2   1.044390196577994   2
    0   3   2.540618378626947   3

    SCIPY DENDOGRAM

image

虹膜数据,10个群集输出


    COST
    27.981293071222368

    CLUSTER CENTERS (rounded)
    Sepal_Length    Sepal_Width Petal_Length    Petal_Width
    4.68    3.08    1.45     0.2
       5     2.4     3.2    1.03
    5.07    3.46    1.44    0.28
     5.4    3.89    1.51    0.27
     5.6    2.66    4.05    1.25
    6.01    2.71    4.95    1.79
     6.4    2.97    4.55    1.41
    6.49     2.9    5.37    1.8
    6.61    3.16    5.57    2.29
    7.48    3.13     6.3    2.05

    ADJACENCY MATRIX 
    fromNodeID  toNodeID    distance
    0   4   1.7986418455383477
    0   5   1.7986418455383477
    1   6   0.32116390552184665
    1   7   0.32116390552184665
    2   0   0.48992124227033834
    2   1   0.48992124227033834
    3   2   2.540618378626947
    3   10  2.540618378626947
    8   12  0.36029111643410283
    8   13  0.36029111643410283

    LOG OUTPUTS
    Feature dimension: 4.
    Number of points: 150.
    Initial cost: 681.3705999999911.
    The minimum number of points of a divisible cluster is 1.
    Dividing 1 clusters on level 1.
    Dividing 2 clusters on level 2.
    Dividing 4 clusters on level 3.
    Dividing 2 clusters on level 4.
    d =The divisible clusters needed for this iteration were : 
    d = 1, cost =681.3705999999911, size = 150
    The divisible clusters needed for this iteration were : 
    d = 3, cost =123.79587628866193, size = 97
    The divisible clusters needed for this iteration were : 
    d = 4, cost =13.72863636363627, size = 22
    The divisible clusters needed for this iteration were : 
    d = 13, cost =10.73588235294028, size = 34

    LINKAGE MATRIX
    node1   node2   distance    tree_size
    2   3   0.32116390552184665 2
    7   8   0.34473006617374347 2
    5   6   0.36029111643410283 2
    17  10  0.48992124227033834 4  
    4   12  0.5802085628837165  3
    11  9   0.839611851358609   3
    14  15  1.044390196577994   6
     0  1   1.7986418455383477  2
    13  16  2.540618378626947   10

    SCIPY DENDOGRAM FAILURE TEST FUNCTION OUTPUTS

    [2.         3.         0.32116391 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              2.0 >= 10+0
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              3.0 >= 10+0
    = False
    -------------------------------------------------

    [7.         8.         0.34473007 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              7.0 >= 10+1
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              8.0 >= 10+1
    = False
    -------------------------------------------------

    [5.         6.         0.36029112 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              5.0 >= 10+2
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              6.0 >= 10+2
    = False
    -------------------------------------------------

    [17.         10.          0.48992124  4.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              17.0 >= 10+3
    = True
    Checking .... if indice B >= # of clusters + iteration we are on
              10.0 >= 10+3
    = False
    -------------------------------------------------

    [ 4.         12.          0.58020856  3.        ]

    Checking.... if indice A >= # of clusters + iteration we are on
              4.0 >= 10+4
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              12.0 >= 10+4
    = False
    -------------------------------------------------

    [11.          9.          0.83961185  3.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              11.0 >= 10+5
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              9.0 >= 10+5
    = False
    -------------------------------------------------

    [14.        15.         1.0443902  6.       ]
    Checking.... if indice A >= # of clusters + iteration we are on
              14.0 >= 10+6
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              15.0 >= 10+6
    = False
    -------------------------------------------------

    [0.         1.         1.79864185 2.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              0.0 >= 10+7
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              1.0 >= 10+7
    = False
    -------------------------------------------------

    [13.         16.          2.54061838 10.        ]
    Checking.... if indice A >= # of clusters + iteration we are on
              13.0 >= 10+8
    = False
    Checking .... if indice B >= # of clusters + iteration we are on
              16.0 >= 10+8
    = False
    -------------------------------------------------

0 个答案:

没有答案