如何正确使用TfidfVectorizer?

时间:2016-11-08 16:40:46

标签: python scikit-learn k-means tf-idf

使用TfidfVectorizer进行kmeans聚类时,我总是遇到错误。

有3例:

  1. 我在TfidfVectorizer中使用tokenizer参数来自定义数据集的标记化过程。这是我的代码:

    vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize) X = vectorizer.fit_transform(titles)

  2. 然而我收到了这个错误:

    ValueError: empty vocabulary; perhaps the documents only contain stop words
    
    1. 我创建的词汇表包含了令牌化的结果,所以代码变成这样:

      vectorizer = TfidfVectorizer(stop_words=stops,tokenizer=tokenize,vocabulary=vocab)

    2. 但我又遇到了一个新错误:

      ValueError: Vocabulary contains repeated indices.
      
      1. 最后,我删除了tokenizer和词汇表参数。代码变成这样:

        vectorizer = TfidfVectorizer(stop_words=stops) X = vectorizer.fit_transform(titles) terms = vectorizer.get_feature_names() true_k = 8 model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) model.fit(X) print "Top terms per cluster:" order_centroids = model.cluster_centers_.argsort()[:, ::-1] for i in range(true_k): print "Cluster %d:" % i, for ind in order_centroids[i, :10]: print ' %s' % terms[ind], print

      2. 嗯,程序运行成功,但聚类结果如下:

        Cluster 0: bangun, rancang, lunak, perangkat, aplikasi, berbasis, menggunakan, service, sistem, pembangunan, Cluster 1: sistem, aplikasi, berbasis, web, menggunakan, pembuatan, mobile, informasi, teknologi, pengembangan, Cluster 2: android, berbasis, aplikasi, perangkat, rancang, bangun, bergerak, mobile, sosial, menggunakan, Cluster 3: implementasi, algoritma, menggunakan, klasifikasi, data, game, fuzzy, vector, support, machine, Cluster 4: metode, menggunakan, video, penerapan, implementasi, steganografi, pengenalan, berbasis, file, analisis, Cluster 5: citra, segmentasi, menggunakan, implementasi, metode, warna, tekstur, kembali, berwarna, temu, Cluster 6: jaringan, tiruan, protokol, voip, syaraf, saraf, menggunakan, implementasi, kinerja, streaming, Cluster 7: studi, kasus, its, informatika, teknik, sistem, informasi, data, surabaya, jurusan,

        有些术语被聚集到多个群集中,例如术语data被置于群集3和群集7中。

        您能告诉我tfidfvectorizerKMeans如何正确对待我们?你的帮助是我的幸福:))

0 个答案:

没有答案