运行k-means(mllib spark scala)之后,我想了解我从使用(和其他变换器)mllib的OneHotEncoder预处理的数据中获得的聚类中心。
中心看起来像这样:
Cluster Center 0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286160.0,0.0,08561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0 ,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
这显然不是非常人性化...有关如何恢复单热编码并检索原始分类功能的任何想法? 如果我找到最接近的数据点(使用k-means使用的相同距离度量,我假设是欧几里德距离)到质心然后恢复该特定数据点的编码会怎么样?
答案 0 :(得分:1)
对于群集质心,不可能(强烈建议)反转编码。想象一下,你有6个原始特征“3”,它被编码为[0.0,0.0,1.0,0.0,0.0,0.0]
。在这种情况下,很容易从编码中提取3作为正确的特征。
但是在kmeans应用程序之后,您可能会得到一个群集质心,可以查找此功能,如[0.0,0.13,0.0,0.77,0.1,0.0]
。如果你想将它解码回你以前的表示,比如6中的“4”,因为功能4具有最大值,那么你将丢失信息并且模型可能会被破坏。
编辑:添加一种可能的方法将数据点上的编码从评论恢复到答案
如果在数据点上有ID,则可以在编码之前将数据点分配给群集以获取旧状态后对ID执行选择/加入操作。