我试图在Spark中生成Bisecting Kmeans Clustering结果的树形图。我在网上找到了这个问题的一些变体,例如here,并且有一个JIRA请求here。但是我找不到其他有工作解决方案的人。
为了尝试实现这一点,我使用yu-iksw的toLinkageMatrix function为Spark编译了Spark MLlib 2.2.0,并对日志输出进行了一些更改,以生成有关Bisecting Clustering Selection Process的更多信息。我已经上传了这个Jar,其中包含一个样本SBT构建用于测试目的,因此任何有兴趣帮助谁不想从源代码重建Spark MLlib的人都可以运行自己的测试。您可以在my github repo上的sbt构建中看到,mllib和mllib-local jar位于/ lib文件夹中。
要绘制我的测试链接矩阵输出,我使用jupyter-notebook并手动将Spark Linkage输出传递给scipy-dendogram。 jupyter笔记本也在my github repo here。
简而言之,当我使用3-4个集群时,使用Iris数据集的测试输出似乎有效,但是当我尝试5个或更多个集群时,链接矩阵无法生成有效的集群索引。我已经尝试了一些不同的方法来解决这个问题,比如改变toLinkageMatrix选择过程和它调用的Array函数,但没有用。
我对Bisecting K-Means聚类有一个不错的概念性理解,但我很难跟踪Spark中链接矩阵失败的确切/原因。如果你看一下我的spark-notebook HTML,你也可以在my github repo上看到我的完整火花代码。
用于编译的完整Spark 2.2.0源代码也在my repo here
中更改的主要源文件是here
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala
和here
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala
以下是使用3和10个群集的群集输出。
注意:我还重新编写了scipy树状图测试功能,以准确说明绘图时z连锁失效的确切位置和原因。
COST
84.20375254574043
CLUSTER CENTERS
Sepal_Length Sepal_Width Petal_Length Petal_Width
5.01 3.37 1.56 0.29
5.95 2.77 4.45 1.45
6.85 3.07 5.74 2.07
ADJACENCY MATRIX
FromNodeID toNodeID distance
0 1 2.540618378626947
0 2 2.540618378626947
2 3 1.044390196577994
2 4 1.044390196577994
LOG OUTPUTS
Feature dimension: 4.
Number of points: 150.
Initial cost: 681.3705999999911.
The minimum number of points of a divisible cluster is 1.
Dividing 1 clusters on level 1.
Dividing 1 clusters on level 2.
The divisible clusters needed for this iteration were : d = 1, cost =681.3705999999911, size = 150
The divisible clusters needed for this iteration were : d = 3, cost =123.79587628866193, size = 97
LINKAGE MATRIX
node1 node2 distance tree_size
1 2 1.044390196577994 2
0 3 2.540618378626947 3
SCIPY DENDOGRAM
COST
27.981293071222368
CLUSTER CENTERS (rounded)
Sepal_Length Sepal_Width Petal_Length Petal_Width
4.68 3.08 1.45 0.2
5 2.4 3.2 1.03
5.07 3.46 1.44 0.28
5.4 3.89 1.51 0.27
5.6 2.66 4.05 1.25
6.01 2.71 4.95 1.79
6.4 2.97 4.55 1.41
6.49 2.9 5.37 1.8
6.61 3.16 5.57 2.29
7.48 3.13 6.3 2.05
ADJACENCY MATRIX
fromNodeID toNodeID distance
0 4 1.7986418455383477
0 5 1.7986418455383477
1 6 0.32116390552184665
1 7 0.32116390552184665
2 0 0.48992124227033834
2 1 0.48992124227033834
3 2 2.540618378626947
3 10 2.540618378626947
8 12 0.36029111643410283
8 13 0.36029111643410283
LOG OUTPUTS
Feature dimension: 4.
Number of points: 150.
Initial cost: 681.3705999999911.
The minimum number of points of a divisible cluster is 1.
Dividing 1 clusters on level 1.
Dividing 2 clusters on level 2.
Dividing 4 clusters on level 3.
Dividing 2 clusters on level 4.
d =The divisible clusters needed for this iteration were :
d = 1, cost =681.3705999999911, size = 150
The divisible clusters needed for this iteration were :
d = 3, cost =123.79587628866193, size = 97
The divisible clusters needed for this iteration were :
d = 4, cost =13.72863636363627, size = 22
The divisible clusters needed for this iteration were :
d = 13, cost =10.73588235294028, size = 34
LINKAGE MATRIX
node1 node2 distance tree_size
2 3 0.32116390552184665 2
7 8 0.34473006617374347 2
5 6 0.36029111643410283 2
17 10 0.48992124227033834 4
4 12 0.5802085628837165 3
11 9 0.839611851358609 3
14 15 1.044390196577994 6
0 1 1.7986418455383477 2
13 16 2.540618378626947 10
SCIPY DENDOGRAM FAILURE TEST FUNCTION OUTPUTS
[2. 3. 0.32116391 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
2.0 >= 10+0
= False
Checking .... if indice B >= # of clusters + iteration we are on
3.0 >= 10+0
= False
-------------------------------------------------
[7. 8. 0.34473007 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
7.0 >= 10+1
= False
Checking .... if indice B >= # of clusters + iteration we are on
8.0 >= 10+1
= False
-------------------------------------------------
[5. 6. 0.36029112 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
5.0 >= 10+2
= False
Checking .... if indice B >= # of clusters + iteration we are on
6.0 >= 10+2
= False
-------------------------------------------------
[17. 10. 0.48992124 4. ]
Checking.... if indice A >= # of clusters + iteration we are on
17.0 >= 10+3
= True
Checking .... if indice B >= # of clusters + iteration we are on
10.0 >= 10+3
= False
-------------------------------------------------
[ 4. 12. 0.58020856 3. ]
Checking.... if indice A >= # of clusters + iteration we are on
4.0 >= 10+4
= False
Checking .... if indice B >= # of clusters + iteration we are on
12.0 >= 10+4
= False
-------------------------------------------------
[11. 9. 0.83961185 3. ]
Checking.... if indice A >= # of clusters + iteration we are on
11.0 >= 10+5
= False
Checking .... if indice B >= # of clusters + iteration we are on
9.0 >= 10+5
= False
-------------------------------------------------
[14. 15. 1.0443902 6. ]
Checking.... if indice A >= # of clusters + iteration we are on
14.0 >= 10+6
= False
Checking .... if indice B >= # of clusters + iteration we are on
15.0 >= 10+6
= False
-------------------------------------------------
[0. 1. 1.79864185 2. ]
Checking.... if indice A >= # of clusters + iteration we are on
0.0 >= 10+7
= False
Checking .... if indice B >= # of clusters + iteration we are on
1.0 >= 10+7
= False
-------------------------------------------------
[13. 16. 2.54061838 10. ]
Checking.... if indice A >= # of clusters + iteration we are on
13.0 >= 10+8
= False
Checking .... if indice B >= # of clusters + iteration we are on
16.0 >= 10+8
= False
-------------------------------------------------