大熊猫的分层聚类给出了Value错误

时间:2017-05-03 02:54:03

标签: python pandas text hierarchical-clustering

我正在使用python来集群文本文档,我将其作为数据框。这就是我在做的事情:

from __future__ import division
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import pandas as pd

data_lst = data_rd['text'].values.tolist()
tfidf_vectorizer = TfidfVectorizer( max_features=200000, stop_words='english',use_idf=True, tokenizer=lambda x: x.split(' '), ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(data_lst) 
print(tfidf_matrix.shape)
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)

#(10193, 32757)
linkage_dist=ward(dist)
linkage_matrix = linkage(tfidf_matrix.todense(), 'ward')

dendrogram(linkage_matrix,truncate_mode="lastp",p=40,
show_leaf_counts=True,leaf_rotation=60.,leaf_font_size=8.,
show_contracted=True,  )
is_valid_linkage(linkage_matrix)
is_valid_linkage(linkage_dist)
#False
#False

我收到此错误:

 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/lib64/python2.6/site-packages/scipy/cluster/hierarchy.py", line 
 2227, in dendrogram
 is_valid_linkage(Z, throw=True, name='Z')
 File "/usr/lib64/python2.6/site-packages/scipy/cluster/hierarchy.py", line 
 1421, in is_valid_linkage
 % name_str)
 ValueError: Linkage 'Z' uses the same cluster more than once.

除了fastcluster之外还有其他方法可以解决这个问题,为什么会发生这种情况? 列中有一行是空白的,没有文本。

0 个答案:

没有答案