我用
singleTFIDF = TfidfVectorizer(analyzer='char_wb', ngram_range=
(4,6),stop_words=my_stop_words, max_features=50).fit([text])
想知道为什么我的功能(如“聊天室”)中存在空格
如何避免这种情况?我需要自己对此进行拉伸和预处理吗?
答案 0 :(得分:1)
使用analyzer='word'
。
当我们使用char_wb
时,矢量化器会填充空白,因为它不会对使用character_n_grams
进行检查的单词进行标记化。
analyzer:字符串,{'word','char','char_wb'}或可调用的
该功能是否应该由单词或字符n-gram组成。选项“ char_wb” 仅从单词边界内的文本创建字符n-gram; 单词边缘的n-gram用空格填充。
请看以下示例,以了解
的用法from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range= (4,6))
X = vectorizer.fit_transform(corpus)
print([(len(w),w) for w in vectorizer.get_feature_names()])
输出:
[(4,'和'),(5,'和'),(4,'doc'),(5,'docu'),(6,'docum'), (4,'fir'),(5,'firs'),(6,'first'),(4,'is'),(4,'one'), (5,'one。'),(6,'one。'),(4,'sec'),(5,'seco'),(6,'secon'), (4,'the'),(5,'the'),(4,'thi'),(5,'thir'),(6,'third'), (5,'this'),(6,'this'),(4,'and'),(4,'cond'),(5,'cond'), (4,'cume'),(5,'cumen'),(6,'cument'),(4,'docu'),(5,'docum'), (6,'docume'),(4,'econ'),(5,'econd'),(6,'econd'),(4,'ent'), (4,'ent。'),(5,'ent。'),(4,'ent?'),(5,'ent?'),(4,'firs'),(5, 'first'),(6,'first'),(4,'hird'),(5,'hird'),(4,'his'),(4, 'ird'),(4,'irst'),(5,'irst'),(4,'ment'),(5,'ment'),(5, 'ment。'),(6,'ment。'),(5,'ment?'),(6,'ment?'),(4,'ne。'),(4, 'nt。 '),(4,'nt?'),(4,'ocum'),(5,'ocume'),(6,'ocumen'),(4, 'ond'),(4,'one。'),(5,'one。'),(4,'rst'),(4,'seco'),(5, “ secon”),(6,“第二”),(4,“ the”),(4,“ thir”),(5,“第三”),(6, 'third'),(4,'this'),(5,'this'),(4,'umen'),(5,'ument'),(6, 'ument'),(6,'ument。'),(6,'ument?')]