Sklearn CountVectorizer按类别获取最常出现的功能

时间:2015-01-15 19:47:35

标签: python numpy scipy scikit-learn

我使用CountVectorizer将一个obs数组转换为一个documentXfeature矩阵,其中每个文档是~50个不同类中的一个。对于每个班级,我都希望看到最常出现的功能。

cv = CountVectorizer(binary=True, tokenizer=lambda x: x.split())

# document X feature sparse matrix
vectored_sites = cv.fit_transform([' '.join([f for f in generator_features(site)]) for site in sites])

# list of classes
document_classes = [site.class for site in sites]

# how to select rows from vectored_sites for each class
class_i_document_features = ??

# compute frequency of each column in class_i_document_features
feature_counts = class_i_document_features.sum(axis=0)
feature_frequencies = feature_counts/class_i_document_features.size()[0]
# print something like (feature1: frequency1, feature2: frequency2 ...}

我无法过滤到单个班级,然后将频率格式化为清晰的结果。

2 个答案:

答案 0 :(得分:0)

我会像这样分组:

from collections import defaultdict

# how to select rows from vectored_sites for each class
class_i_document_features = defaultdict(list)
for site, vector in zip(sites, vectored_sites):
    class_i_document_features[site.class].append(vector)

这是你在尝试做什么吗?

答案 1 :(得分:0)

看起来我遇到了scipy稀疏矩阵的问题,不支持版本0.11.0大纲中的掩码索引:Slicing a scipy sparse matrix using a boolean mask。鉴于我切换到基于元素索引的索引并获得了下面的解决方案

# this would be in a for loop by class name
class_i = 'someclass'
class_i_indexes = [i for i in xrange(len(sites)) if sites[i].class == class_i]
for word, total in sorted(zip(cv.get_feature_names(),
                              np.asarray(vectored_sites[class_i_indexes].sum(axis=0)).ravel()),
                          key=lambda x: -x[1])[:10]:
    print '%s: %d %f' % (word, total, float(total)/len(class_i_indexes))