TF-IDF矢量化器在带有char_wb的特征词中有空格?

时间:2019-01-22 13:01:50

标签: python scikit-learn tfidfvectorizer

我用

singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range= 
(4,6),stop_words=my_stop_words, max_features=50).fit([text])

想知道为什么我的功能(如“聊天室”)中存在空格

如何避免这种情况?我需要自己对此进行拉伸和预处理吗?

1 个答案:

答案 0 :(得分:1)

使用analyzer='word'

当我们使用char_wb时,矢量化器会填充空白,因为它不会对使用character_n_grams进行检查的单词进行标记化。

根据Documentation:

  

analyzer:字符串,{'word','char','char_wb'}或可调用的

     

该功能是否应该由单词或字符n-gram组成。选项“ char_wb”   仅从单词边界内的文本创建字符n-gram;   单词边缘的n-gram用空格填充。

请看以下示例,以了解

的用法
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])

输出:

  

[(4,'和'),(5,'和'),(4,'doc'),(5,'docu'),(6,'docum'),   (4,'fir'),(5,'firs'),(6,'first'),(4,'is'),(4,'one'),   (5,'one。'),(6,'one。'),(4,'sec'),(5,'seco'),(6,'secon'),   (4,'the'),(5,'the'),(4,'thi'),(5,'thir'),(6,'third'),   (5,'this'),(6,'this'),(4,'and'),(4,'cond'),(5,'cond'),   (4,'cume'),(5,'cumen'),(6,'cument'),(4,'docu'),(5,'docum'),   (6,'docume'),(4,'econ'),(5,'econd'),(6,'econd'),(4,'ent'),   (4,'ent。'),(5,'ent。'),(4,'ent?'),(5,'ent?'),(4,'firs'),(5,   'first'),(6,'first'),(4,'hird'),(5,'hird'),(4,'his'),(4,   'ird'),(4,'irst'),(5,'irst'),(4,'ment'),(5,'ment'),(5,   'ment。'),(6,'ment。'),(5,'ment?'),(6,'ment?'),(4,'ne。'),(4,   'nt。 '),(4,'nt?'),(4,'ocum'),(5,'ocume'),(6,'ocumen'),(4,   'ond'),(4,'one。'),(5,'one。'),(4,'rst'),(4,'seco'),(5,   “ secon”),(6,“第二”),(4,“ the”),(4,“ thir”),(5,“第三”),(6,   'third'),(4,'this'),(5,'this'),(4,'umen'),(5,'ument'),(6,   'ument'),(6,'ument。'),(6,'ument?')]