Question

因此，我正在使用一个热编码器对数据中的分类特征之一进行编码，但是我无法理解其中的参数。你们能帮我吗，它的作用是什么。参数为：categorical_features = [0]

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features= [0])
X = onehotencoder.fit_transform(X).toarray()

Answer 1

OneHotEncoder的目的是将分类整数特征编码为单发数字数组。 docs中所述的categorical_features参数用于：

指定将哪些功能视为分类。

这可能用于以下情况：我们希望将所有特征（即分类特征和数字特征）直接提供给编码器，并指定我们想要的一组特征。这是一个如何使用它的示例：

df = pd.DataFrame({'col1':[4,5,6], 'col2':[1,2,3]})

onehotencoder = OneHotEncoder(categorical_features= [True, False])
onehotencoder.fit_transform(df.values).toarray()

array([[1., 0., 0., 1.],
       [0., 1., 0., 2.],
       [0., 0., 1., 3.]])

在这种情况下，我们指定了mask来指示我们希望将哪些特征作为一个热点使用，因此在本例中是第一个。 categorical_features也适用于一系列索引，categorical_features= [0]会产生相同的结果。

Answer 2

此参数在here中有详细说明。

指定将哪些功能视为分类。


“全部”：所有功能均视为类别。

索引数组：分类特征索引数组。

mask：长度为n_features且具有dtype = bool的数组。

顺便说一句，此参数将被弃用。

从0.20版开始不推荐使用：categorical_features关键字在0.20版中不再推荐使用，并将在0.22版中删除。您可以改用ColumnTransformer。

机器学习（ML）的功能是什么？它们是输入或我们的测量结果的独立变量。功能的数量是我们定义的。我们对特征进行一次热编码，以表示特征是独立的。

Answer 3

categorical_features : 'all' or array of indices or mask, default='all'
    Specify what features are treated as categorical.
    - 'all': All features are treated as categorical.
    - array of indices: Array of categorical feature indices.
    - mask: Array of length n_features and with dtype=bool.
    Non-categorical features are always stacked to the right of the matrix.
    .. deprecated:: 0.20
        The `categorical_features` keyword was deprecated in version
        0.20 and will be removed in 0.22.
        You can use the ``ColumnTransformer`` instead.
Attributes
----------
categories_ : list of arrays
    The categories of each feature determined during fitting
    (in order of the features in X and corresponding with the output
    of ``transform``). This includes the category specified in ``drop``
    (if any).
drop_idx_ : array of shape (n_features,)
    ``drop_idx_[i]`` is the index in ``categories_[i]`` of the category to
    be dropped for each feature. None if all the transformed features will
    be retained.

如果您想更深入地研究此link，可以参考此github link

一个热编码器参数

3 个答案: