我正在使用scikit-learn进行文本分类。使用单一功能可以很好地工作,但引入多个功能会给我带来错误。我认为问题在于我没有像分类器所期望的那样格式化数据。
例如,这很好用:
data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
但是这个:
data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
死于
Traceback (most recent call last):
File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
classifier.fit(X_train, Y_train)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
在调用classifier.fit()之后的预处理阶段。我认为问题在于我正在格式化数据,但我无法弄清楚如何正确使用它。
feature1和feature2都是英文文本字符串,目标也是。我正在使用LabelEncoder()来编码目标,这似乎工作正常。
以下是print data
返回的示例,让您了解它现在的格式。
[['some short english text'
'a paragraph of english text']
['some more short english text'
'a second paragraph of english text']
['some more short english text'
'a third paragraph of english text']]
答案 0 :(得分:2)
特定的错误消息使得您的代码看起来像某个地方需要str
(以便可以调用.lower
),而是接收整个数组(可能是整个数组) str
S)。
您是否可以编辑问题以更好地描述数据并发布完整的追溯,而不仅仅是指定错误的小部分?
与此同时,您也可以尝试
data = df[['feature1', 'feature2']].values
和
df['target'].values
而不是自己明确地投射到np.ndarray
。
在我看来,正在制作一个阵列,它是1x1,而“array”中的单个元素本身就是ndarray
。
答案 1 :(得分:0)
如果文本列具有相同的编码器/转换器,请将列合并在一起。
data = np.append(df.feature1. df.feature2)