是什么导致ColumnTransformer出现这种奇怪的行为? [Python / sklearn]

时间:2019-10-18 20:49:39

标签: python scikit-learn

下面我有一些代码,我无法弄清楚为什么它会产生结果。

我正在尝试与ColumnTransformer一起工作,但是遇到一些问题,无法获得正确的结果。

我的例子有点奇怪,但这是我能得到的最简单的可复制例子。我试图进一步简化示例,但每次尝试时,似乎都失去了问题,因此,如果示例大于实际需要,我深表歉意。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
import numpy as np

samples = np.array([[0, 0, 0, 'Education', 7432, 2008.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [9, 0, 0, 'Office', 27000, 2010.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [3, 0, 0, 'Education', 23685, 2002.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [7, 0, 0, 'Education', 121074, 1989.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [4, 0, 0, 'Education', 116607, 1975.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [1, 0, 0, 'Education', 2720, 2004.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [2, 0, 0, 'Education', 5376, 1991.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0, 0.0, 1.0,
                     0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0, 1.0, 2016,
                     0.0, 1.0],
                    [6, 0, 0, 'Lodging/residential', 27926, 1981.0, np.nan, 25.0, 6.0, 20.0, np.nan, 1019.7, 0.0, 0.0,
                     0.0, 1.0, 0.5406408174555976, 0.8412535328311812, -0.8660254037844385, -0.5000000000000004, 0.0,
                     1.0, 2016, 0.0, 1.0]])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

scaler = ColumnTransformer(transformers=[("cat", categorical_transformer, [0, 1, 3, 4])])
print(scaler.fit_transform(samples[:5]))
print(scaler.fit_transform(samples[:6]))

因此,对于samples[:5]数组子集,我得到以下结果:

[[1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0.]]

这是我所期望的,并且得到了相同的结果。

但是,当我对samples[:6]数组子集进行转换时

我明白了

  (0, 0)    1.0
  (0, 6)    1.0
  (0, 7)    1.0
  (0, 14)   1.0
  (1, 5)    1.0
  (1, 6)    1.0
  (1, 8)    1.0
  (1, 12)   1.0
  (2, 2)    1.0
  (2, 6)    1.0
  (2, 7)    1.0
  (2, 11)   1.0
  (3, 4)    1.0
  (3, 6)    1.0
  (3, 7)    1.0
  (3, 10)   1.0
  (4, 3)    1.0
  (4, 6)    1.0
  (4, 7)    1.0
  (4, 9)    1.0
  (5, 1)    1.0
  (5, 6)    1.0
  (5, 7)    1.0
  (5, 13)   1.0

我不知道这种数据格式是什么。我想弄清楚为什么当我添加额外的样本时我的定标器会返回此类数据。

1 个答案:

答案 0 :(得分:0)

如评论中所述,其答案相同,但以稀疏表示形式表示。一种方法是使用todense()将稀疏数组转换为密集数组。

另一种方法是在sparse=False本身中设置OneHotEncoder

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])

scaler = ColumnTransformer(transformers=[("cat", categorical_transformer, [0, 1, 3, 4])])

请记住转换输出,因为密集会带来更高的计算和内存成本。如果要处理大型数据集,建议将其保留为稀疏数组本身。