Question

我正在尝试将字典转换为距离矩阵，然后将其用作分层聚类的输入：我有输入：

键：长度为2的元组以及与我有距离的对象

值：实际距离值

for k,v in obj_distances.items():
print(k,v)

结果是：

('obj1', 'obj2') 2.0 
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80

我的问题是如何将其转换为距离矩阵，以后可以在scipy中进行聚类？

Answer 1

使用熊猫并堆叠数据框：

import pandas as pd

data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}

df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values

产量

In [15]: dist_matrix
Out[15]:

array([[2.  , 1.95,  nan],
       [ nan, 1.8 ,  nan],
       [ nan,  nan, 1.58]])

Answer 2

这比发布的其他答案要慢，但是将确保包括对角线中间和上方的值，如果这对您很重要：

import pandas as pd

unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)

for k, v in obj_distance.items():
    df.loc[k[0], k[1]] = v
    df.loc[k[1], k[0]] = v

结果：

      obj1 obj2  obj3  obj4
obj1   NaN    2  1.95   NaN
obj2     2  NaN   1.8   NaN
obj3  1.95  1.8   NaN  1.58
obj4   NaN  NaN  1.58   NaN

Answer 3

您说您将使用scipy进行群集，因此我认为这意味着您将使用功能scipy.cluster.hierarchy.linkage。 linkage接受“压缩”形式的距离数据，因此您不必创建完整的对称距离矩阵。（例如，请参见How does condensed distance matrix work? (pdist)，以获取关于压缩格式的讨论。）

因此，您要做的就是将obj_distances.values()设为已知顺序并将其传递给linkage。以下代码段就是这样：

from scipy.cluster.hierarchy import linkage, dendrogram

obj_distances = {
    ('obj2', 'obj3'): 1.8,
    ('obj3', 'obj1'): 1.95,
    ('obj1', 'obj4'): 2.5,
    ('obj1', 'obj2'): 2.0,
    ('obj4', 'obj2'): 2.1,
    ('obj3', 'obj4'): 1.58,
}

# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b.  If this is already true, then the next three lines can be
# replaced with
#     sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)

# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendrogram(Z, labels=labels)

树状图：

将距离对转换为距离矩阵以用于分层聚类

3 个答案: