我使用scipy.cluster.hierarchy.dendrogram制作了树状图, 使用以下生成的数据:
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
c = np.random.multivariate_normal([8, 2], [[3, 1], [1, 4]], size=[80,])
X = np.concatenate((a, b, c),)
创建链接功能:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, 'ward')
然后:
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=5, # show only the last p merged clusters
show_leaf_counts=False, # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
)
现在,我的数据中总共有230个观测值,这些观测值分为p = 5个簇。对于每个群集,我想拥有其中所有观测值的所有行索引的列表。另外,我想知道这5个群集之上的层次结构。
谢谢!
答案 0 :(得分:0)
我是集群和树状图的新手。因此,欢迎指出是否存在错误。
# put X in a dataframe
df = pd.DataFrame()
df['col1']=X[:,0]
df['col2']=X[:,1]
index=[]
for i in range(len(X)):
elem = 'A' + str(i)
index.append(elem)
df['index'] = index
print(df.shape)
df.head()
Z = linkage(X, 'ward')
dendrogram(
Z,
truncate_mode='lastp', # show only the last p merged clusters
p=5, # show only the last p merged clusters
show_leaf_counts=True, # otherwise numbers in brackets are counts
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True, # to get a distribution impression in truncated branches
);
plt.show()
# retrieve elements in each cluster
label = fcluster(Z, 5, criterion='maxclust')
df_clst = pd.DataFrame()
df_clst['index'] = df['index']
df_clst['label'] = label
# print them
for i in range(5):
elements = df_clst[df_clst['label']==i+1]['index'].tolist()
size = len(elements)
print('\n Cluster {}: N = {} {}'.format(i+1, size, elements))