我拿了一堆文件并为所有文件中的每个标记计算了tf * idf,并为每个文档创建了向量(每个n维,n是语料库中唯一的单词的数量)。我无法弄清楚如何使用sklearn.cluster.MeanShift
从向量创建集群答案 0 :(得分:1)
TfidfVectorizer将文档转换为"稀疏矩阵"数字。 MeanShift要求传递给它的数据是密集的"。下面,我将展示如何在管道中转换它(credit)但是,在内存允许的情况下,您可以使用toarray()
或todense()
将稀疏矩阵转换为密集矩阵。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import MeanShift
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
documents = ['this is document one',
'this is document two',
'document one is fun',
'document two is mean',
'document is really short',
'how fun is document one?',
'mean shift... what is that']
pipeline = Pipeline(
steps=[
('tfidf', TfidfVectorizer()),
('trans', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
('clust', MeanShift())
])
pipeline.fit(documents)
pipeline.named_steps['clust'].labels_
result = [(label,doc) for doc,label in zip(documents, pipeline.named_steps['clust'].labels_)]
for label,doc in sorted(result):
print(label, doc)
打印:
0 document two is mean
0 this is document one
0 this is document two
1 document one is fun
1 how fun is document one?
2 mean shift... what is that
3 document is really short
你可以修改"超参数"但是这给了我一个大概的想法。