Question

我正在尝试使用Titanic数据集。

我想在几列上使用LabelBinarizer，我想避免使用for循环。

我正在尝试使用lambda功能，但它不起作用：

from sklearn.preprocessing import LabelBinarizer 

pp = LabelBinarizer()

X = df['sex', 'embarked', 'alive'] df.apply(lambda X: pp.fit_transform())

并且：

df[['sex','embarked','alive']]= df[['sex','embarked','alive']].apply(lambda x: pp.fit_transform(x))

有人能指出我正确的方向吗？

Answer 1

我认为问题在于，因为你在左边传递三个cols，sklearn会感到困惑。

<强>替代

但正如@unutbu所说，df.apply和for之间的效果没有差异，所以我只想使用它：

for col in ['sex','embarked','alive']:
     df[col] = pp.fit_transform(df[col])

但是，如果你真的做一个班轮，那么你就是这样做的（警告，大规模矫枉过正）：

在fit，tranform和fit_transform方法中添加另一层缩进，因为格式不起作用（应该与def __init__方法的缩进相匹配。< / p>

class MultiColumnLabelBinarizer:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode`

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelBinarizer().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelBinarizer().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

df = MultiColumnLabelBinarizer(columns = ['embarked','alive']).fit_transform(df)

来源：Label encoding across multiple columns in scikit-learn

Scikit学习使用lambda函数预处理LabelBinarizer

1 个答案: