Question

我基本上使用mini_batch_kmeans和kmeans算法对我的一些文档进行聚类。我只是按照教程浏览scikit-learn网站，其链接如下： http://scikit-learn.org/stable/auto_examples/text/document_clustering.html

他们正在使用一些方法进行矢量化，其中一个方法是HashingVectorizer。在hashingVectorizer中，他们使用TfidfTransformer（）方法创建管道。

# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
                               stop_words='english', non_negative=True,
                               norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())

一旦这样做，我得到的矢量化器没有方法get_feature_names（）。但是因为我正在使用它进行聚类，所以我需要得到＆＃34;术语＆＃34;使用此＆＃34; get_feature_names（）＆＃34;

terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

我如何解决这个错误，我有点困在这里，可以帮助我。 **

我的整个代码如下所示：

**

X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
                                                vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)

用tfidf管道的计数向量。

def count_tfidf_vectorizer(self,contents):
    count_vect = CountVectorizer()
    vectorizer = make_pipeline(count_vect,TfidfTransformer())
    X_train_vecs = vectorizer.fit_transform(contents)
    print("The count of bow : ", X_train_vecs.shape)
    return X_train_vecs, vectorizer

和mini_batch_kmeans类如下：

class MiniBatchKmeansTechnique():
    def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
                              filenames, contents, svd=None, is_dimension_reduced=True):
        km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
                         init_size=1000, batch_size=1000, verbose=True, random_state=42)
        print("Clustering sparse data with %s" % km)
        t0 = time()
        km.fit(X_train_vecs)
        print("done in %0.3fs" % (time() - t0))
        print()
        cluster_labels = km.labels_.tolist()
        print("List of the cluster names is : ",cluster_labels)
        data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
        frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
        print(frame['cluster_label'].value_counts(sort=True,ascending=False))
        print()
        grouped = frame['cluster_label'].groupby(frame['cluster_label'])
        print(grouped.mean())
        print()
        print("Top Terms Per Cluster :")

        if is_dimension_reduced:
            if svd != None:
                original_space_centroids = svd.inverse_transform(km.cluster_centers_)
                order_centroids = original_space_centroids.argsort()[:, ::-1]
        else:
            order_centroids = km.cluster_centers_.argsort()[:, ::-1]

        terms = vectorizer.get_feature_names()
        for i in range(number_cluster):
            print("Cluster %d:" % i, end=' ')
            for ind in order_centroids[i, :10]:
                print(' %s' % terms[ind], end=',')
            print()
            print("Cluster %d filenames:" % i, end='')
            for file in frame.ix[i]['filename'].values.tolist():
                print(' %s,' % file, end='')
            print()

快速回复表示赞赏。感谢您节省时间。

Answer 1

Pipeline没有get_feature_names（）方法，因为为Pipeline实现此方法并不简单 - 需要考虑所有管道步骤以获取功能名称。请参阅https://github.com/scikit-learn/scikit-learn/issues/6424，https://github.com/scikit-learn/scikit-learn/issues/6425等。 - 有很多相关的门票和多次尝试修复它。

如果您的管道很简单（TfidfVectorizer后跟MiniBatchKMeans），那么您可以从TfidfVectorizer获取功能名称。

如果您想使用HashingVectorizer，它会更复杂，因为HashingVectorizer不会按设计提供功能名称。 HashingVectorizer不存储词汇表，而是使用哈希值 - 它意味着它可以应用于在线设置，并且它不需要任何RAM - 但权衡的正是你没有获得功能名称。

尽管如此，仍然可以从HashingVectorizer获取功能名称;要做到这一点，你需要将它应用于文档样本，存储哪些哈希对应于哪些单词，这样就可以了解这些哈希的含义，即什么是功能名称。可能存在冲突，因此不可能100％确定功能名称是否正确，但通常这种方法可以正常工作。这种方法在eli5库中实现;有关示例，请参阅http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer。您必须使用InvertableHashingVectorizer：

执行此类操作

from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec)  # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the 
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)  
hashing_feat_names = ivec.get_feature_names()

然后您可以使用hashing_feat_names作为功能名称，因为TfidfTransformer不会更改输入矢量大小，只是缩放相同的功能。

Answer 2

来自make_pipeline文档：

This is a shorthand for the Pipeline constructor; it does not require, and
    does not permit, naming the estimators. Instead, their names will be set
    to the lowercase of their types automatically.

因此，为了访问功能名称，在您安装数据后，您可以：

# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline

hasher = HashingVectorizer(n_features=10,
                           stop_words='english', non_negative=True,
                            norm=None, binary=False)

tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...    
# fit to the data
# ... 

# use the instance's class name to lower 
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()

# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik

希望这有帮助，祝你好运！

修改：使用您所关注的示例查看更新后的问题后，@ Vivek Kumar是正确的，此代码terms = vectorizer.get_feature_names()将不会为管道运行，但仅限于：

vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf)

＆＃39;管道＆＃39;对象没有属性＆＃39; get_feature_names＆＃39;在scikit-learn中

2 个答案: