将距离对转换为距离矩阵以用于分层聚类

时间:2018-08-03 14:04:14

标签: python scipy scikit-learn hierarchical-clustering

我正在尝试将字典转换为距离矩阵,然后将其用作分层聚类的输入:我有输入:

  • 键:长度为2的元组以及与我有距离的对象
  • 值:实际距离值

    for k,v in obj_distances.items():
    print(k,v)
    

结果是:

('obj1', 'obj2') 2.0 
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80

我的问题是如何将其转换为距离矩阵,以后可以在scipy中进行聚类?

3 个答案:

答案 0 :(得分:3)

使用熊猫并堆叠数据框:

import pandas as pd

data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}

df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values

产量

In [15]: dist_matrix
Out[15]:

array([[2.  , 1.95,  nan],
       [ nan, 1.8 ,  nan],
       [ nan,  nan, 1.58]])

答案 1 :(得分:1)

这比发布的其他答案要慢,但是将确保包括对角线中间和上方的值,如果这对您很重要:

import pandas as pd

unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)

for k, v in obj_distance.items():
    df.loc[k[0], k[1]] = v
    df.loc[k[1], k[0]] = v

结果:

      obj1 obj2  obj3  obj4
obj1   NaN    2  1.95   NaN
obj2     2  NaN   1.8   NaN
obj3  1.95  1.8   NaN  1.58
obj4   NaN  NaN  1.58   NaN

答案 2 :(得分:1)

您说您将使用scipy进行群集,因此我认为这意味着您将使用功能scipy.cluster.hierarchy.linkagelinkage接受“压缩”形式的距离数据,因此您不必创建完整的对称距离矩阵。 (例如,请参见How does condensed distance matrix work? (pdist),以获取关于压缩格式的讨论。)

因此,您要做的就是将obj_distances.values()设为已知顺序并将其传递给linkage。以下代码段就是这样:

from scipy.cluster.hierarchy import linkage, dendrogram

obj_distances = {
    ('obj2', 'obj3'): 1.8,
    ('obj3', 'obj1'): 1.95,
    ('obj1', 'obj4'): 2.5,
    ('obj1', 'obj2'): 2.0,
    ('obj4', 'obj2'): 2.1,
    ('obj3', 'obj4'): 1.58,
}

# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b.  If this is already true, then the next three lines can be
# replaced with
#     sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))

# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)

# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))

dendrogram(Z, labels=labels)

树状图:

dendrogram