Question

我的数据集包含一个文本块以及一个具有摘要计数的列，它看起来像这样：

文本，计数（列名）

这是我的家，100

我在哪里10

这是小菜一碟，2

我通过互联网获得的用于构造字母组合的代码

def get_top_n_words(corpus, n=None):
    vec = sk.feature_extraction.text.CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_words = get_top_n_words(df['text'], 20)

使用标准的CountVectorizer，我将生成一个像这样的unigram：

这2

是2

我的1

其中1

上午1   我1

a 1

作品1

of 1

蛋糕1

我希望它可以按其计数加权，因为它是一个汇总计数，即：

这102

是102

我的100

其中10

上午10点

i 10

a 2

第2部分

of 2

蛋糕2

这可能吗？

Answer 1

您可以做的是在toarray之后使用transform方法，以便能够在之后的计数值上进行矩阵乘法：

def get_top_n_words(corpus, count, n=None): # add the parameter with the count values
    vec = feature_extraction.text.CountVectorizer().fit(corpus)
    # here multiply the toarray of transform with the count values
    bag_of_words = vec.transform(corpus).toarray()*count.values[:,None] 
    sum_words = bag_of_words.sum(axis=0) 
    # accessing the value in sum_words is a bit different but still related to idx
    words_freq = [(word, sum_words[idx]) for word, idx in vec.vocabulary_.items()] 
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(df['text'], df['count'], 20)
print (common_words)
[('this', 102),
 ('is', 102),
 ('my', 100),
 ('home', 100),
 ('where', 10),
 ('am', 10),
 ('piece', 2),
 ('of', 2),
 ('cake', 2)]

我该如何使用CountVectorizer权重（从列值中）而不是计数来完成加权的会标/会标/三字母组合？

1 个答案: