我有以下代码从一组文件中提取功能(文件夹名称是类别名称),用于文本分类。
import sklearn.datasets
from sklearn.feature_extraction.text import TfidfVectorizer
train = sklearn.datasets.load_files('./train', description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0)
print len(train.data)
print train.target_names
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train.data)
它抛出以下堆栈跟踪:
Traceback (most recent call last):
File "C:\EclipseWorkspace\TextClassifier\main.py", line 16, in <module>
X_train = vectorizer.fit_transform(train.data)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 1285, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 236, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 32054: invalid start byte
我运行Python 2.7。我怎样才能让它发挥作用?
修改
我刚刚发现这对于utf-8
编码的文件非常有用(我的文件是ANSI
编码的)。我有什么方法可以sklearn.datasets.load_files()
使用ANSI
编码吗?
答案 0 :(得分:0)
ANSI是UTF-8的严格子集,因此它应该可以正常工作。但是,从堆栈跟踪中,您的输入似乎包含某个字节0xFF,这不是有效的ANSI字符。
答案 1 :(得分:0)
我通过更改&#39; strict&#39;中的错误设置来解决问题。进入&#39;忽略&#39;
vectorizer = CountVectorizer(binary = True, decode_error = u'ignore')
word_tokenizer = vectorizer.build_tokenizer()
doc_terms_list_train = [word_tokenizer(str(doc_str, encoding = 'utf-8', errors = 'ignore')) for doc_str in doc_str_list_train]
doc_train_vec = vectorizer.fit_transform(doc_str_list_train)
here is the detailed explanation of countvectorizer fucntion