我正在尝试将字典转换为距离矩阵,然后将其用作分层聚类的输入:我有输入:
值:实际距离值
for k,v in obj_distances.items():
print(k,v)
结果是:
('obj1', 'obj2') 2.0
('obj3', 'obj4') 1.58
('obj1','obj3') 1.95
('obj2', 'obj3') 1.80
我的问题是如何将其转换为距离矩阵,以后可以在scipy中进行聚类?
答案 0 :(得分:3)
使用熊猫并堆叠数据框:
import pandas as pd
data = {('obj1', 'obj2'): 2.0 ,
('obj3', 'obj4'): 1.58,
('obj1','obj3'): 1.95,
('obj2', 'obj3'): 1.80,}
df = pd.DataFrame.from_dict(data, orient='index')
df.index = pd.MultiIndex.from_tuples(df.index.tolist())
dist_matrix = df.unstack().values
产量
In [15]: dist_matrix
Out[15]:
array([[2. , 1.95, nan],
[ nan, 1.8 , nan],
[ nan, nan, 1.58]])
答案 1 :(得分:1)
这比发布的其他答案要慢,但是将确保包括对角线中间和上方的值,如果这对您很重要:
import pandas as pd
unique_ids = sorted(set([x for y in obj_distance.keys() for x in y]))
df = pd.DataFrame(index=unique_ids, columns=unique_ids)
for k, v in obj_distance.items():
df.loc[k[0], k[1]] = v
df.loc[k[1], k[0]] = v
结果:
obj1 obj2 obj3 obj4
obj1 NaN 2 1.95 NaN
obj2 2 NaN 1.8 NaN
obj3 1.95 1.8 NaN 1.58
obj4 NaN NaN 1.58 NaN
答案 2 :(得分:1)
您说您将使用scipy进行群集,因此我认为这意味着您将使用功能scipy.cluster.hierarchy.linkage
。 linkage
接受“压缩”形式的距离数据,因此您不必创建完整的对称距离矩阵。 (例如,请参见How does condensed distance matrix work? (pdist),以获取关于压缩格式的讨论。)
因此,您要做的就是将obj_distances.values()
设为已知顺序并将其传递给linkage
。以下代码段就是这样:
from scipy.cluster.hierarchy import linkage, dendrogram
obj_distances = {
('obj2', 'obj3'): 1.8,
('obj3', 'obj1'): 1.95,
('obj1', 'obj4'): 2.5,
('obj1', 'obj2'): 2.0,
('obj4', 'obj2'): 2.1,
('obj3', 'obj4'): 1.58,
}
# Put each key pair in a canonical order, so we know that if (a, b) is a key,
# then a < b. If this is already true, then the next three lines can be
# replaced with
# sorted_keys, distances = zip(*sorted(obj_distances.items()))
# Note: we assume there are no keys where the two objects are the same.
keys = [sorted(k) for k in obj_distances.keys()]
values = obj_distances.values()
sorted_keys, distances = zip(*sorted(zip(keys, values)))
# linkage accepts the "condensed" format of the distances.
Z = linkage(distances)
# Optional: create a sorted list of the objects.
labels = sorted(set([key[0] for key in sorted_keys] + [sorted_keys[-1][-1]]))
dendrogram(Z, labels=labels)
树状图: