用word2vec和Kmeans聚类

时间:2018-07-30 09:07:08

标签: python python-3.x cluster-analysis k-means word2vec

我正在尝试使用word2vec和Kmeans进行聚类,但是它不起作用。

这是我数据的一部分:

demain fera chaud à paris pas marseille
mauvais exemple ce n est pas un cliché mais il faut comprendre pourquoi aussi
il y a plus de travail à Paris c est d ailleurs pour cette raison qu autant de gens",
mais s il y a plus de travail, il y a aussi plus de concurrence
s agglutinent autour de la capitale

脚本:

import nltk
import pandas
import pprint
import numpy as np
import pandas as pd
from sklearn import cluster
from sklearn import metrics
from gensim.models import Word2Vec
from nltk.cluster import KMeansClusterer
from sklearn.metrics import adjusted_rand_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import NMF

dataset = pandas.read_csv('text.csv', encoding = 'utf-8')

comments = dataset['comments']

verbatim_list = no_duplicate.values.tolist()

min_count = 2
size = 50
window = 4

model = Word2Vec(verbatim_list, min_count=min_count, size=size, window=window)

X = model[model.vocab]

clusters_number = 28
kclusterer = KMeansClusterer(clusters_number,  distance=nltk.cluster.util.cosine_distance, repeats=25)

assigned_clusters = kclusterer.cluster(X, assign_clusters=True)

words = list(model.vocab)
for i, word in enumerate(words):  
    print (word + ":" + str(assigned_clusters[i]))

kmeans = cluster.KMeans(n_clusters = clusters_number)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_

clusters = {}
for commentaires, label in zip(verbatim_list, labels):
    try:
        clusters[str(label)].append(verbatim)
    except:
       clusters[str(label)] = [verbatim]
pprint.pprint(clusters)

输出:

  

回溯(最近通话最近一次):

     

中的文件“ kmwv.py”,第37行      

X = model [model.vocab]

     

AttributeError:“ Word2Vec”对象没有属性“ vocab”

我需要一个可以与word2vec一起使用的群集,但是每次尝试尝试时,都会出现此错误。有没有办法用word2vec进行聚类?

1 个答案:

答案 0 :(得分:2)

就像戴维德所说的那样,试试这个:

X = model[model.wv.vocab]