我尝试使用soundex将一行中的每个单词转换为散列版本,然后使用scikit-learn在其上执行一些机器学习。
代码如下:
train = []
for line in text:
a = ' '
sound = []
for word in line.split():
sound.append(soundex(word))
a = ' '.join(sound)
train.append(a)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(real_train)
但是当我这样做时,我收到了一个错误:
X_train_counts = count_vect.fit_transform(real_train)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 710, in _count_vocab
analyze = self.build_analyzer()
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 226, in build_analyzer
tokenize = self.build_tokenizer()
File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py", line 203, in build_tokenizer
token_pattern = re.compile(self.token_pattern)
File "/usr/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of pattern