Question

我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),stop_words=my_stop_words, max_features=50).fit([text])

想知道为什么我的功能（如“聊天室”）中存在空格

如何避免这种情况？我需要自己对此进行拉伸和预处理吗？

Answer 1

使用analyzer='word'。

当我们使用char_wb时，矢量化器会填充空白，因为它不会对使用character_n_grams进行检查的单词进行标记化。

根据Documentation:

analyzer：字符串，{'word'，'char'，'char_wb'}或可调用的

该功能是否应该由单词或字符n-gram组成。选项“ char_wb”   仅从单词边界内的文本创建字符n-gram；   单词边缘的n-gram用空格填充。

请看以下示例，以了解

的用法

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

输出：

[（4，'和'），（5，'和'），（4，'doc'），（5，'docu'），（6，'docum'），（4，'fir'），（5，'firs'），（6，'first'），（4，'is'），（4，'one'），（5，'one。'），（6，'one。'），（4，'sec'），（5，'seco'），（6，'secon'），（4，'the'），（5，'the'），（4，'thi'），（5，'thir'），（6，'third'），（5，'this'），（6，'this'），（4，'and'），（4，'cond'），（5，'cond'），（4，'cume'），（5，'cumen'），（6，'cument'），（4，'docu'），（5，'docum'），（6，'docume'），（4，'econ'），（5，'econd'），（6，'econd'），（4，'ent'），（4，'ent。'），（5，'ent。'），（4，'ent？'），（5，'ent？'），（4，'firs'），（5， 'first'），（6，'first'），（4，'hird'），（5，'hird'），（4，'his'），（4， 'ird'），（4，'irst'），（5，'irst'），（4，'ment'），（5，'ment'），（5， 'ment。'），（6，'ment。'），（5，'ment？'），（6，'ment？'），（4，'ne。'），（4， 'nt。 '），（4，'nt？'），（4，'ocum'），（5，'ocume'），（6，'ocumen'），（4， 'ond'），（4，'one。'），（5，'one。'），（4，'rst'），（4，'seco'），（5， “ secon”），（6，“第二”），（4，“ the”），（4，“ thir”），（5，“第三”），（6， 'third'），（4，'this'），（5，'this'），（4，'umen'），（5，'ument'），（6， 'ument'），（6，'ument。'），（6，'ument？'）]

TF-IDF矢量化器在带有char_wb的特征词中有空格？

1 个答案: