我正在研究一个示例,该示例使用ColumnTransformer
和LabelEncoder
来预处理著名的Titanic数据集X
:
Age Embarked Fare Sex
0 22.0 S 7.2500 male
1 38.0 C 71.2833 female
2 26.0 S 7.9250 female
3 35.0 S 53.1000 female
4 35.0 S 8.0500 male
像这样呼叫变压器:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
ColumnTransformer(
transformers=[
("label-encode categorical", LabelEncoder(), ["Sex", "Embarked"])
]
).fit(X).transform(X)
导致:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-54-fd5a05b7e47e> in <module>
4 ("label-encode categorical", LabelEncoder(), ["Sex", "Embarked"])
5 ]
----> 6 ).fit(X).transform(X)
~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit(self, X, y)
418 # we use fit_transform to make sure to set sparse_output_ (for which we
419 # need the transformed data) to have consistent output type in predict
--> 420 self.fit_transform(X, y=y)
421 return self
422
~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
447 self._validate_remainder(X)
448
--> 449 result = self._fit_transform(X, y, _fit_transform_one)
450
451 if not result:
~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
391 _get_column(X, column), y, weight)
392 for _, trans, column, weight in self._iter(
--> 393 fitted=fitted, replace_strings=True))
394 except ValueError as e:
395 if "Expected 2D array, got 1D array instead" in str(e):
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
915 # remaining jobs.
916 self._iterating = False
--> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params)
612 def _fit_transform_one(transformer, X, y, weight, **fit_params):
613 if hasattr(transformer, 'fit_transform'):
--> 614 res = transformer.fit_transform(X, y, **fit_params)
615 else:
616 res = transformer.fit(X, y, **fit_params).transform(X)
TypeError: fit_transform() takes 2 positional arguments but 3 were given
这里的**fit_params
有什么问题?对我来说,这似乎是sklearn
中的错误,或者至少是不兼容。
答案 0 :(得分:2)
我认为这实际上是LabelEncoder
的问题。 LabelEncoder.fit
方法仅接受self
和y
作为参数(这很奇怪,因为大多数转换器对象都具有fit(X, y=None, **fit_params)
的范例)。无论如何,无论您通过了什么,在管道中都将使用fit_params
调用该变压器。在这种特殊情况下,传递给LabelEncoder.fit
的确切参数是X
和空字典{}
。从而引发错误。
从我的角度来看,这是LabelEncoder
中的一个错误,但是您应该与sklearn人员一起解决这个问题,因为他们可能有某些理由不同地实现fit
方法。
答案 1 :(得分:2)
有两个主要原因导致此操作无法满足您的目的。
LabelEncoder()
旨在用于目标变量(y)。这就是columnTransformer()
尝试提供X, y=None, fit_params={}
时出现位置参数错误的原因。 使用0到n_classes-1之间的值编码标签。
适合(y)
安装标签编码器参数:
y:形状类似数组的形状(n_samples,)
目标值。
LabelEncoder()
也不能采用2D数组(一次基本上是多个要素),因为它仅需要一维y
值。简短的回答-我们不应该将LabelEncoder()
用于输入功能。
现在,对输入要素进行编码的解决方案是什么?
如果特征是顺序特征,请使用OrdinalEncoder()
;对于名义特征,请使用OneHotEncoder()
。
示例:
>>> from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
>>> from sklearn.preprocessing import Normalizer
>>> ct = ColumnTransformer(
... [("ordinal", OrdinalEncoder(), [0, 1]),
("nominal", OneHotEncoder(), [2, 3])])
>>> X = np.array([[0., 1., 'apple', 'green'],
... [1., 1., 'orange', 'blue']])
>>> ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 1., 1., 0.]])
答案 2 :(得分:0)
它被称为 label 编码器,因为它旨在与数据集的 labels 一起使用,即y 值。这门课让我很困惑,直到我意识到这一点。
虽然这很令人困惑,因为在文献中,我们要么对我们的特征进行单热编码,要么对它们进行标签编码。从这个意义上说,Sklearn 对新手并不友好。
改用 OrdinalEncoder
,它旨在与功能一起使用。