使用TfidfVectorizer进行kmeans聚类时,我总是遇到错误。
有3例:
我在TfidfVectorizer中使用tokenizer
参数来自定义数据集的标记化过程。这是我的代码:
vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize)
X = vectorizer.fit_transform(titles)
然而我收到了这个错误:
ValueError: empty vocabulary; perhaps the documents only contain stop words
我创建的词汇表包含了令牌化的结果,所以代码变成这样:
vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize,vocabulary=vocab)
但我又遇到了一个新错误:
ValueError: Vocabulary contains repeated indices.
最后,我删除了tokenizer和词汇表参数。代码变成这样:
vectorizer = TfidfVectorizer(stop_words=stops)
X = vectorizer.fit_transform(titles)
terms = vectorizer.get_feature_names()
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
print "Top terms per cluster:"
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
嗯,程序运行成功,但聚类结果如下:
Cluster 0: bangun, rancang, lunak, perangkat, aplikasi, berbasis, menggunakan, service, sistem, pembangunan,
Cluster 1: sistem, aplikasi, berbasis, web, menggunakan, pembuatan, mobile, informasi, teknologi, pengembangan,
Cluster 2: android, berbasis, aplikasi, perangkat, rancang, bangun, bergerak, mobile, sosial, menggunakan,
Cluster 3: implementasi, algoritma, menggunakan, klasifikasi, data, game, fuzzy, vector, support, machine,
Cluster 4: metode, menggunakan, video, penerapan, implementasi, steganografi, pengenalan, berbasis, file, analisis,
Cluster 5: citra, segmentasi, menggunakan, implementasi, metode, warna, tekstur, kembali, berwarna, temu,
Cluster 6: jaringan, tiruan, protokol, voip, syaraf, saraf, menggunakan, implementasi, kinerja, streaming,
Cluster 7: studi, kasus, its, informatika, teknik, sistem, informasi, data, surabaya, jurusan,
有些术语被聚集到多个群集中,例如术语data
被置于群集3和群集7中。
您能告诉我tfidfvectorizer
和KMeans
如何正确对待我们?你的帮助是我的幸福:))