Question

感谢对此线程的深刻见解：Pairwise Wasserstein distance on 2 arrays，我能够提出一个自定义函数来查找一组二维数组（10个点，具有x，y坐标）之间的距离度量）。我的下一步是找到一种方法，以将这些信息提供给聚集聚类算法，例如scipy.cluster.hierarchy模块的fcluster（）方法。

更具体地说，我想使用以下函数为3维数据数组理想地找到一组n个簇。我不确定如何调整pairwise-wasserstein函数来获取fcluster聚集地查找聚类分配所需的距离矩阵。

感谢任何提前提出的想法！

import numpy as np
from scipy.optimize import linear_sum_assignment
from scipy.cluster.hierarchy import dendrogram, linkage, ward
from scipy.cluster.hierarchy import fcluster

data = np.array([[[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]],
                 [[5, 6], [7, 8], [5, 6], [7, 8], [5, 6], [7, 8], [5, 6], [7, 8], [5, 6], [7, 8]],
                 [[1, 15], [3, 2], [1, 2], [5, 4], [1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]],
                 [[5, 1], [7, 8], [5, 6], [7, 1], [5, 6], [7, 8], [5, 1], [7, 8], [5, 6], [7, 8]]])


def wasserstein_distance_function(f1, f2):
    min_cost = np.inf
    f1 = f1.reshape((10, 2))
    f2 = f2.reshape((10, 2))
    for l in np.linspace(0.8, 1.2, 3):
        for k in np.linspace(0.8, 1.2, 3):
            cost = distance.cdist(l * f1, k * f2, 'sqeuclidean')
            row_ind, col_ind = linear_sum_assignment(cost)
            curr_cost = cost[row_ind, col_ind].sum()
            if curr_cost < min_cost:
                min_cost = curr_cost
    return min_cost

def pairwise_wasserstein(points):
    """
    Helper function to perform the pairwise distance function of all points within 'points' parameter
    """
    for first_index in range(0,points.shape[0]):
      for second_index in range(first_index+1,points.shape[0]):
        print("First index: ", first_index, ", Second index: ", second_index, ", Distance: ",wasserstein_distance_function(points[first_index],points[second_index]))

def find_clusters_formation(data):
    """
    Method to find the clusters for the points array
    """
    dist_mat = pairwise_wasserstein(data)
    Z = ward(dist_mat)
    cluster = fcluster(Z, 3, criterion='maxclust')

Answer 1

如果要使用预定义的度量标准，则必须创建一个距离矩阵，该矩阵是对角线为0的二次矩阵。当然，对角线对角线为零的原因是：点到自身的距离为零。然后将此矩阵作为参数传递给聚类算法的fit_predict函数。

第一步-创建距离矩阵并计算数据点之间的距离：

distance_matrix = np.asarray([
    [wasserstein_distance_function(data[first_index], data[second_index]) 
         for first_index in range(len(data))] 
             for second_index in range(len(data))])

这将打印以下内容：

array([[  0.  , 100.8 ,  76.4 ,  96.32],
       [100.8 ,   0.  , 215.  ,  55.68],
       [ 76.4 , 215.  ,   0.  , 186.88],
       [ 96.32,  55.68, 186.88,   0.  ]])

第二步-根据需要用参数填充聚类算法：

clusterer = AgglomerativeClustering(n_clusters=3, affinity="precomputed", linkage="average", distance_threshold=None)

第三步-提取标签：

clusterer.fit_predict(distance_matrix)

此打印：

array([2, 0, 1, 0], dtype=int64)

它能达到您想要的吗？

Answer 2

更新：

我可能通过将所有10个玩家x和y坐标的[1，20]数组拟合为以下形式来使其工作：[x1，y1，x2，y2，...，x10，y10]然后如上wasserstein_distance_function中所示重塑它们。

我尚不确定100％是否可行，但最初的结果似乎很有希望（即，相当均衡的集群）。

自定义成对距离函数的聚集聚类

2 个答案: