Question

我正在使用CountVectorizer创建共现矩阵的稀疏矩阵表示。

我有一个句子列表，还有另一个“权重”列表（向量），即我希望对每个句子标记进行计数的次数。

可以创建一个列表，其中每个句子根据其相关权重重复很多次，但这是非常低效且不合Python的。我的一些体重在数百万甚至更高。

如何有效地告诉CountVectorizer使用我拥有的权重向量？

Answer 1

由于无法（我能找到）将权重应用于提供给countvectorizer的每个句子，因此可以乘以所得的稀疏矩阵。

cv = CountVectorizer(lowercase = False, min_df=0.001, tokenizer = space_splitter)
X = cv.fit_transform(all_strings)

# Multiply the resulting sparse matrix by the weight (count) of each sentence.
counts = scipy.sparse.diags(df.weight, 0)
X = (X.T*counts).T
Xc = (X.T * X) # create co-occurance matrix

请注意，您乘以的矩阵必须是稀疏矩阵，权重必须在对角线上。

如何在CountVectorizer中将权重应用于句子（对每个句子标记计数几次）

1 个答案: