我正在使用scikitlearn TfIdfVectorizer进行文本分类,并且文档具有空格。对于我的分类,空格是我的词汇表的一部分,问题是:如何在词汇表上插入空格?
代码示例:
vocab = [' ', '<', '>', '"', '#', '\'']
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer='char', vocabulary=set(vocab))
X = vectorizer.fit_transform(df['x'])
y = df['y']
print(vectorizer.vocabulary_)
抛出错误:
Traceback (most recent call last):
File "/empty/path/script.py", line 158, in <module>
tf_idf_analysis(http_df);
File "/empty/path/script.py", line 96, in tf_idf_analysis
X = vectorizer.fit_transform(df['x']);
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 1381, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 255, in <lambda>
return lambda doc: self._char_ngrams(preprocess(self.decode(doc)))
File "/usr/local/lib/python3.6/dist-packages/sklearn/feature_extraction/text.py", line 158, in _char_ngrams
text_document = self._white_spaces.sub(" ", text_document)
TypeError: expected string or bytes-like object