Question

from sklearn.preprocessing import LabelEncoder as LE, OneHotEncoder as OHE
import numpy as np

a = np.array([[0,1,100],[1,2,200],[2,3,400]])


oh = OHE(categorical_features=[0,1])
a = oh.fit_transform(a).toarray()

假设第一列和第二列是分类数据。此代码执行一个热编码，但对于回归问题，我想在对分类数据进行编码后删除第一列。在这个例子中，有两个，我可以手动完成。但是，如果你有许多分类功能，你会如何解决这个问题呢？

Answer 1

为此，我使用了类似包装器的包装器，该包装器也可以在管道中使用：

class DummyEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, n_values='auto'):
        self.n_values = n_values

    def transform(self, X):
        ohe = OneHotEncoder(sparse=False, n_values=self.n_values)
        return ohe.fit_transform(X)[:,:-1]

    def fit(self, X, y=None, **fit_params):
        return self

Answer 2

您可以使用numpy的花式索引并切掉第一列：

>>> a
array([[   1.,    0.,    0.,    1.,    0.,    0.,  100.],
       [   0.,    1.,    0.,    0.,    1.,    0.,  200.],
       [   0.,    0.,    1.,    0.,    0.,    1.,  400.]])
>>> a[:, 1:]
array([[   0.,    0.,    1.,    0.,    0.,  100.],
       [   1.,    0.,    0.,    1.,    0.,  200.],
       [   0.,    1.,    0.,    0.,    1.,  400.]])

如果您要删除列的列表，请按以下步骤操作：

>>> idx_to_delete = [0, 3]
>>> indices = [i for i in range(a.shape[-1]) if i not in idx_to_delete]
>>> indices
[1, 2, 4, 5, 6]
>>> a[:, indices]
array([[   0.,    0.,    0.,    0.,  100.],
       [   1.,    0.,    1.,    0.,  200.],
       [   0.,    1.,    0.,    1.,  400.]])

Answer 3

为了自动执行此操作，我们通过识别分类功能中最常见的级别，在应用一个热编码之前获取要删除的索引列表。这是因为最常见的级别最适合作为基础级别，允许评估其他级别的重要性。

应用一个热编码后，我们得到要保留的索引列表，并使用它删除先前确定的列。

from sklearn.preprocessing import OneHotEncoder as OHE
import numpy as np
import pandas as pd

a = np.array([[0,1,100],[1,2,200],[2,3,400]])

def get_indices_to_drop(X_before_OH, categorical_indices_list):
    # Returns list of index to drop after doing one hot encoding
    # Dropping most common level within the categorical variable
    # This is because the most common level serves best as the base level,
    # Allowing the importance of other levels to be evaluated
    indices_to_drop = []
    indices_accum = 0
    for i in categorical_indices_list:
        most_common = pd.Series(X_before_OH[:,i]).value_counts().index[0]
        indices_to_drop.append(most_common + indices_accum)
        indices_accum += len(np.unique(X_before_OH[:,i])) - 1
    return indices_to_drop

indices_to_drop = get_indices_to_drop(a, [0, 1])

oh = OHE(categorical_features=[0,1])
a = oh.fit_transform(a).toarray()

def get_indices_to_keep(X_after_OH, index_to_drop_list):
    return [i for i in range(X_after_OH.shape[-1]) if i not in index_to_drop_list]

indices_to_keep = get_indices_to_keep(a, indices_to_drop)
a = a[:, indices_to_keep]

Answer 4

当处理建筑模型时，这是Sklearn中一键编码器的限制之一。如果您有多个分类变量，最好的方法是首先使用LabelEncoder为每个分类变量标识唯一的标签，然后利用它们生成要删除的索引。例如，如果您将数据放在X的numpy数组中，并且FIRST_IDX，SECOND_IDX，THIRD_IDX列中包含分类变量，请首先使用LabelEncoder对它们进行编码。

labelencoder_X_1 = LabelEncoder()
X[:, FIRST_IDX] = labelencoder_X_1.fit_transform(X[:, FIRST_IDX])

labelencoder_X_2 = LabelEncoder()
X[:, SECOND_IDX] = labelencoder_X_2.fit_transform(X[:, SECOND_IDX])

labelencoder_X_3 = LabelEncoder()
X[:, THIRD_IDX] = labelencoder_X_3.fit_transform(X[:, THIRD_IDX])

然后应用“一键编码器”，它将在数组的开头为所有分类变量创建一个表示，一个接着一个。

onehotencoder = OneHotEncoder(categorical_features=[FIRST_IDX, SECOND_IDX, THIRD_IDX])

X = onehotencoder.fit_transform(X).toarray()

最后，通过利用每个分类变量的唯一值的大小并使用numpy中的累加和（此处，累加和为分类变量的第一个条目索引）消除每个分类变量的第一个条目。

index_to_delete = np.cumsum([0,
               len(labelencoder_X_1.classes_),
               len(labelencoder_X_2.classes_),
               len(labelencoder_X_3.classes_)
               ])
index_to_keep = [i for i in range(X.shape[1]) if i not in index_to_delete]

X = X[:, index_to_keep]

现在X包含准备在任何建模任务中使用的数据。

使用sklearn的OneHotEncoder删除列

4 个答案: