Question

如何获取sklearn.feature_extraction.text.CountVectorizer创建的词汇表中每个词的词频（TF）并将其放入列表或字典中？

似乎与词汇表中的键相对应的所有值的整数值都小于我在初始化CountVectorizer时手动设置的max_features的值，而不是TF，它应该是浮点数。有人可以帮我吗？

CV=CountVectorizer(ngram_range(ngram_min_file_opcode,ngram_max_file_opcode), 
                   decode_error="ignore", max_features=max_features_file_re,
                   token_pattern=r'\b\w+\b', min_df=1, max_df=1.0) 
x = CV.fit_transform(x).toarray()

Answer 1

如果期望浮点值，则可能正在寻找TFIDF。在这种情况下，请使用sklearn.feature_extraction.text.TfidfVectorizer或sklearn.feature_extraction.text.CountVectorizer，后跟sklearn.feature_extraction.text.TfidfTransformer，

如果您实际上只想使用TF，则仍然可以使用TfidfVectorizer或CountVectorizer，然后再加上TfidfTransformer，只需确保设置{{1 }} / use_idf到TfidfVectorizer，而Transformer（规范化）参数到False或norm。这会标准化TF计数。

从SKLearn文档中：

'l1'

行'l2'对应于第一个文档。第一个元素对应于文档中>>> from sklearn.feature_extraction.text import CountVectorizer >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(corpus) >>> print(vectorizer.get_feature_names()) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] >>> print(X.toarray()) [[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0 1 1 1 0 0 1 0 1]]出现了多少次，第二个[0 1 1 1 0 0 1 0 1]，第三个and等。

如何使用CountVectorizer提取TF？

1 个答案: