我正在使用python来集群文本文档,我将其作为数据框。这就是我在做的事情:
from __future__ import division
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np
import pandas as pd
data_lst = data_rd['text'].values.tolist()
tfidf_vectorizer = TfidfVectorizer( max_features=200000, stop_words='english',use_idf=True, tokenizer=lambda x: x.split(' '), ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(data_lst)
print(tfidf_matrix.shape)
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
#(10193, 32757)
linkage_dist=ward(dist)
linkage_matrix = linkage(tfidf_matrix.todense(), 'ward')
dendrogram(linkage_matrix,truncate_mode="lastp",p=40,
show_leaf_counts=True,leaf_rotation=60.,leaf_font_size=8.,
show_contracted=True, )
is_valid_linkage(linkage_matrix)
is_valid_linkage(linkage_dist)
#False
#False
我收到此错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.6/site-packages/scipy/cluster/hierarchy.py", line
2227, in dendrogram
is_valid_linkage(Z, throw=True, name='Z')
File "/usr/lib64/python2.6/site-packages/scipy/cluster/hierarchy.py", line
1421, in is_valid_linkage
% name_str)
ValueError: Linkage 'Z' uses the same cluster more than once.
除了fastcluster之外还有其他方法可以解决这个问题,为什么会发生这种情况? 列中有一行是空白的,没有文本。