python中的文档聚类

时间:2014-11-26 05:13:01

标签: python-3.x scipy scikit-learn k-means

我是python和scikit-learn的新手,我要集群文本文件(NEWS的主体),我使用以下代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import nltk, sklearn, string, os
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.cluster import KMeans

# Preprocessing text with NLTK package
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
###########################################################################
# Loading and preprocessing text data
print("\n Loading text dataset:")
path = 'n'

for subdir, dirs, files in (os.walk(path)):
    for i,f in enumerate(files):
        if f != '.DS_Store':
                file_path = subdir + os.path.sep + f
                shakes = open(file_path, 'r')
                text = shakes.read()
                lowers = text.lower()
                no_punctuation = lowers.translate(string.punctuation)
                token_dict[f] = no_punctuation
###########################################################################
true_k = 3 # *
print("\n Performing stemming and tokenization...")
vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
                              stop_words='english')
X = vectorizer.fit_transform(token_dict.values())
print("n_samples: %d, n_features: %d" % X.shape)
print()
###############################################################################
# Do the actual clustering
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
y=km.fit(X)
print(km)

print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

此代码获得了顶级单词。但它是什么文档,我怎么知道哪些原始文本文件属于cluster0,cluster1或cluster2?

1 个答案:

答案 0 :(得分:1)

要解释一下 - 您可以使用以下内容存储群集分配:

clusters = km.labels_.tolist()

此列表的排序方式与您传递给矢量图的字典相同。

我刚刚整理了一份您可能会发现有用的文档群集指南。如果我能更详细地解释一下,请告诉我:http://brandonrose.org/clustering