Question

类似：Pipeline doesn't work with Label Encoder

我想拥有一个处理标签编码（在我的情况下为LabelEncoder），转换和估计的对象。对我来说很重要的是，所有这些功能只能通过一个对象执行。

我尝试过通过这种方式使用管道：

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# mock training dataset
X = np.random.rand(1000, 100)
y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])

le = LabelEncoder()
ss = StandardScaler()
clf = MyClassifier()
pl = Pipeline([('encoder', le),
               ('scaler', ss),
               ('clf', clf)])
pl.fit(X, y)

哪个给：

File "sklearn/pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
TypeError: fit_transform() takes exactly 2 arguments (3 given)

说明：

X和y是我的训练数据集，X是值，y是目标标签。
X是numpy.ndarray的形状（n_sample，n_features），类型为float，值从0到1。
y是形状（n_sample，）且类型为字符串的numpy.ndarray
我希望LabelEncoder编码y，而不是X。
我仅需要y的{{1}}，并且我需要将其编码为整数为MyClassifier工作。

经过一番思考并面对上面的错误，我觉得认为MyClassifier可以解决这个问题很幼稚。我发现Pipeline可以很好地一起处理我的转换和分类器，但是标签编码部分会失败。

实现我想要的正确方法是什么？正确地说，我的意思是做一些允许可重用和与Pipeline保持某种一致性的事情。 sklearn库中有一个可以满足我要求的类吗？

我很惊讶我没有找到一个浏览网页的答案，因为我觉得自己在做什么并不罕见。我可能在这里错过了一些东西。

Answer 1

我相信这是不可能的。

首先，所有转换器都从sklearn.base.TransformerMixin继承。 jQuery("#documents-container").jstree({ "cache": false, "core": { "core": { check_callback: true }, "animation": 0, "themes": { "stripes": true } }, "types": { "default": { "icon": "icon-folder-closed" }, "file": { "icon": "icon-records" } }, "plugins": [ "wholerow", "types" ] });方法使用fit_transform和可选的X参数，但仅返回y。 scikit-learn在设计时并未考虑这种转换。

第二，LabelEncoder将在管道中失败，因为X_new和fit仅接受一个参数transform，而不是y。

最后，我编写了一个函数，用于在将字符串标签映射为整数标签的X, y中进行查找。至少然后，转换是通过代码进行的，并且可以使用版本控制进行跟踪。

Answer 2

Vivek Kumar在评论中写道：

当您调用clf.fit（）时，LabelEncoder将自动在y上调用。因此，您无需担心。 y可以将整数，字符串作为类，这些将由scikit中的估计器正确处理。因此，无需在管道中包含LabelEncoder即可在y上工作。

这是我的问题的解决方案：

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# mock training dataset
X = np.random.rand(1000, 100)
y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])

ss = StandardScaler()
clf = MyClassifier()  # my own classifier
pl = Pipeline([('scaler', ss),
               ('clf', clf)])
pl.fit(X, y)

唯一的区别是，现在pl.predict(X)将返回一个字符串数组，其中包含值“ label1”，“ label2”或“ label3”（这是有道理的，因为这就是我们所需要的）。

如果需要，可以找回sklearn.pipeline自动使用的LabelEncoder，可以执行以下操作：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder(pl.classes_)

哪个可以给我管道pl使用的标签编码器的副本。

Answer 3

我已经用pandas实现了分类编码，作为分类器，我使用了SGDClassifier。您上面的代码调用了$records['datasets'] = [ ["label" => "A", "backgroundColor" => "#77ef77", "data" => []], ["label" => "B", "backgroundColor" => "#84c584", "data" => []], ["label" => "C", "backgroundColor" => "#f9f96d", "data" => []], ["label" => "TECH", "backgroundColor" => "#d68d8d", "data" => []] ]; foreach($records['labels'] as $record) { $records['datasets'][0]['data'][] = $grouped[$record['uid']]['A'] ?: 0; $records['datasets'][1]['data'][] = $grouped[$record['uid']]['B'] ?: 0; $records['datasets'][2]['data'][] = $grouped[$record['uid']]['C'] ?: 0; $records['datasets'][3]['data'][] = $grouped[$record['uid']]['TECH'] ?: 0; }，但未在代码本身中定义。

MyClassifier()

输出是拟合管道对象：

import numpy as np
import pandas as pd
# from sklearn.preprocessing import LabelEncoder # No longer used
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

X = np.random.randn(1000, 10)

y_initial = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])

df = pd.DataFrame({'y':y_initial})
df['y'] = df['y'].astype('category') # Same as the output of LabelEncoder

ss = StandardScaler()
clf = SGDClassifier()

y = df['y']

pl = Pipeline([('scaler', ss),
               ('clf', clf)])

pl.fit(X,y)

在一个对象中处理标签编码，转换和估计

3 个答案: