Question

我正在使用pandas和sklearn的python并尝试使用新的非常方便的sklearn-pandas。

我有一个大数据框，需要以类似的方式转换多个列。

我在变量other中有多个列名源代码文档here 明确说明有可能使用相同的转换转换多个列，但以下代码的行为不符合预期：

from sklearn.preprocessing import MinMaxScaler, LabelEncoder

mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())

我收到以下错误：

引发ValueError（“输入形状错误{0}”。格式（形状）） ValueError：['EFW'，'BPD']：输入形状错误（154,2）

当我使用以下代码时，效果很好：

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())

根据我的理解，两者都应该运作良好并产生相同的结果。我在这里做错了什么？

谢谢！

Answer 1

您遇到的问题是，两个代码段在数据结构方面完全不同。

cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]构建一个元组列表。请注意，您可以将这行代码缩短为：

cols = [(col, LabelEncoder()) for col in other]

无论如何，第一个代码段[[other[0],other[1]],LabelEncoder()]会产生一个包含两个元素的列表：列表和LabelEncoder实例。现在，记录了您可以通过指定：

来转换多个列

转换可能需要多个输入列。在这些情况下，可以在列表中指定列名称：


mapper2 = DataFrameMapper（[       （[＆＃39;孩子＆＃39;，＆＃39;薪水＆＃39;]，sklearn.decomposition.PCA（1））        ]）

这是包含list结构元素的tuple(list, object)，而不是list[list, object]结构元素。

如果我们看看源代码本身，

class DataFrameMapper(BaseEstimator, TransformerMixin):
    """
    Map Pandas data frame column subsets to their own
    sklearn transformation.
    """

    def __init__(self, features, default=False, sparse=False, df_out=False,
                 input_df=False):
        """
        Params:
        features    a list of tuples with features definitions.
                    The first element is the pandas column selector. This can
                    be a string (for one column) or a list of strings.
                    The second element is an object that supports
                    sklearn's transform interface, or a list of such objects.
                    The third element is optional and, if present, must be
                    a dictionary with the options to apply to the
                    transformation. Example: {'alias': 'day_of_week'}

在类定义中也明确指出，DataFrameMapper的features参数必须是元组列表，其中元组的元素可能是列表。

最后一点，关于您实际收到错误消息的原因：LabelEncoder中的sklearn转换器用于在1D阵列上进行标记。因此，它基本上不能同时处理2列，并将引发异常。因此，如果要使用LabelEncoder，则必须构建具有1个列名的N个元组和变换器，其中N是要转换的列数。

Python sklearn-pandas同时转换多列错误

1 个答案: