我使用CountVectorizer将一个obs数组转换为一个documentXfeature矩阵,其中每个文档是~50个不同类中的一个。对于每个班级,我都希望看到最常出现的功能。
cv = CountVectorizer(binary=True, tokenizer=lambda x: x.split())
# document X feature sparse matrix
vectored_sites = cv.fit_transform([' '.join([f for f in generator_features(site)]) for site in sites])
# list of classes
document_classes = [site.class for site in sites]
# how to select rows from vectored_sites for each class
class_i_document_features = ??
# compute frequency of each column in class_i_document_features
feature_counts = class_i_document_features.sum(axis=0)
feature_frequencies = feature_counts/class_i_document_features.size()[0]
# print something like (feature1: frequency1, feature2: frequency2 ...}
我无法过滤到单个班级,然后将频率格式化为清晰的结果。
答案 0 :(得分:0)
我会像这样分组:
from collections import defaultdict
# how to select rows from vectored_sites for each class
class_i_document_features = defaultdict(list)
for site, vector in zip(sites, vectored_sites):
class_i_document_features[site.class].append(vector)
这是你在尝试做什么吗?
答案 1 :(得分:0)
看起来我遇到了scipy稀疏矩阵的问题,不支持版本0.11.0大纲中的掩码索引:Slicing a scipy sparse matrix using a boolean mask。鉴于我切换到基于元素索引的索引并获得了下面的解决方案
# this would be in a for loop by class name
class_i = 'someclass'
class_i_indexes = [i for i in xrange(len(sites)) if sites[i].class == class_i]
for word, total in sorted(zip(cv.get_feature_names(),
np.asarray(vectored_sites[class_i_indexes].sum(axis=0)).ravel()),
key=lambda x: -x[1])[:10]:
print '%s: %d %f' % (word, total, float(total)/len(class_i_indexes))