Question

我有以下数据集，其中Direccion del viento（Pos）列具有分类值

总共 Direccion del viento（Pos）有8个类别：

SO-Sur oeste
SE-斯特雷斯特
S-Sur
N-北部
否-否
NE-诺埃斯特
O-Oeste
E-埃斯特

然后，我将此数据帧转换为numpy数组，然后得到：

direccion_viento_pos
dtype: bool
[['S']
 ['S']
 ['S']
 ...
 ['SO']
 ['NO']
 ['SO']]

因为我有字符串值，所以我希望它们是数字值，所以我需要对分类变量进行编码。也就是说，将我们拥有的文本编码为数值

然后我执行两项活动：

我使用LabelEncoder（）根据我拥有的类别将值简单地编码为数字。

标签编码只是将列中的每个值转换为数字

labelencoder_direccion_viento_pos = LabelEncoder()
direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0])

我使用OneHotEncoding将每个类别值转换为新列，并为该列分配1或0（真/假）值：

这是：

onehotencoder = OneHotEncoder(categorical_features = [0])
direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

就是这样，因为我得到了这些新值：

direccion_viento_pos
array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

然后，我将此direccion_viento_pos数组转换为数据框，以可视化的最佳方式查看它：

# Turn array to dataframe with columns indexes
cols = ['E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO']
df_direccion_viento = pd.DataFrame(direccion_viento_pos, columns=cols)

然后，我可以通过每个类别值获得一个新列，并为该列分配一个1或0（真/假）值。

如果我使用 pandas.get_dummies() 功能，我将得到相同的结果。

我的问题是：这是处理这些分类变量的最佳方法吗？在应用自动学习算法时，每个类别都有一列，并且其中几个类别的值为零，这是否不利于数据产生偏差或干扰？

我最近开始在this article中阅读有关它的内容，但是对此我提供的任何指导意见

更新

我一直在阅读有关管理上述类别变量的新方法，并且发现以下内容：

In this link of Jupyter notebooks exercises（在单元格编号59上的）属于Hands-on Machine Learning with Scikit-Learn and TensorFlow book，作者在LabelEncoder中谈到以下内容：

警告：该书的早期版本使用LabelEncoder类或Pandas的Series.factorize（）方法将字符串分类属性编码为整数。但是，最好计划在Scikit-Learn 0.20中引入OrdinalEncoder类（请参阅PR＃10521），因为它是为输入要素（X代替标签y）设计的

这意味着LabelEncoder用于编码因变量，而不是输入要素。我的direccion_viento分类变量数据集是输入要素。

最初，在scikit-learn开发版本0.20上，它存在CategoricalEncoder。我将this class复制到categorical_encoder.py文件中并应用它：

from __future__ import unicode_literals
import pandas as pd

# I import the Categorical Encoder locally from my project environment
from notebooks.DireccionDelViento.sklearn.preprocessing.categorical_encoder import CategoricalEncoder

# Read the dataset
direccion_viento = pd.read_csv('Direccion del viento.csv', )

# No null values
print(direccion_viento.isnull().any())
direccion_viento.isnull().values.any()

# We select only the first  Direccion Viento (pos) column
direccion_viento = direccion_viento[['Direccion del viento (Pos)']]

encoder = CategoricalEncoder(encoding='onehot-dense', handle_unknown='ignore')
dir_viento_encoder = encoder.fit_transform(direccion_viento[['Direccion del viento (Pos)']])
print(" These are the categories", encoder.categories_)

cols = ['E', 'N', 'NE', 'NO', 'O', 'S','SE','SO']
df_direccion_viento = pd.DataFrame(dir_viento_encoder, columns=cols)

生成的数据集类似于使用LabelEncoding和OneHotEncoding

使用OneHotEncoder()和使用CategoricalEncoder()之间的区别在于，当我使用 CategoricalEncoder（）不必应用LabelEncoder（），原因是CategoricalEncoder可以直接处理字符串，并且我不需要先将变量值转换为整数。

这意味着CategoricalEncoder与OneHotEncoder相同或应用它们的结果实际上是相同的...

关于CategoricalEncoder()类，在阅读和搜索之后，AurélienGéron在他们的书中告诉我们CategoricalEncoder将在 scikit-learn-0.20 中弃用稳定版本。

事实上，scikit学习团队in their current master branch denote认为CategoricalEncoder()

CategoricalEncoder简要存在于0.20dev中。它的功能已整合到OneHotEncoder和OrdinalEncoder中。

此pull request名为 Rethinking CategoricalEncoder API吗？，也表示要弃用CategoricalEncoder()

的工作流程

然后根据以上所述，我已应用OrdinalEncoder，并且得到的结果与仅应用LabelEncoder时的结果相同

from __future__ import unicode_literals
# from .future_encoders import OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Read the dataset
direccion_viento = pd.read_csv('Direccion del viento.csv', )

# No null values
print(direccion_viento.isnull().any())
direccion_viento.isnull().values.any()

# We select only the first column Direccion Viento (pos)
direccion_viento = direccion_viento[['Direccion del viento (Pos)']]
print(direccion_viento.head(10))

ordinal_encoder = OrdinalEncoder()
direccion_viento_cat_encoded = ordinal_encoder.fit_transform(direccion_viento)

然后我得到了这个数组，这与我使用LabelEncoder()时的结果类似：

OrdinalEncoder和LabelEncoder之间的区别是什么，以您的概念作为参考：

LabelEncoder（）可以根据我拥有的类别将值简单地编码为数字。标签编码只是将列中的每个值转换为数字

和

OrdinalEncoder：将分类特征编码为整数数组。此转换器的输入应为整数或字符串之类的数组，表示分类（离散）特征所采用的值。要素将转换为序数整数。这样会导致每个功能的一列整数（0到n_categories-1）

我可以选择通过应用OneHotEncoding技术创建的结果数据集，还是选择通过应用OrdinalEncoder技术创建的数据集？什么是最合适的？

我这么认为，有必要区分标称和标称功能。序数特征可以理解为可以排序或排序的分类值。

塞巴斯蒂安·拉施卡（{3}}说，这个样本与分类数据有关

例如，T恤尺寸将是一个序数特征，因为我们可以定义一个顺序XL> L>M。相反，标称特征并不意味着任何顺序，而继续前面的示例，我们可以将T恤的颜色视为标称特征，因为通常说红色比蓝色大没有道理。

我的direccion_viento值('E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO')没有任何顺序，或者任何值大于或小于其他值。认为它们本质上是序数是没有道理的吗？真的吗？

从这个意义上讲，直到现在为止，我都认为OneHotEncoding是我的direccion_viento输入功能的最佳选择

有人在以下情况之前告诉我：

取决于您打算如何处理数据。有各种各样的方法使用分类变量。您需要通过调查是否适用于您正在研究的模型/情况来选择更合适的> 您采用的方法正适合您所使用的模型。

我将使用聚类，线性回归和神经网络等模型。

如何知道OrdinalEncoder或OneHotEncoder最合适？

Answer 1

简而言之：是的，这是一种转换分类变量的通用方法。

关于此方法是否会引入更多的噪声：信息量相同，因此仅此一项就没有任何效果。如果您担心现在只有0个值的列，则与数据和采样质量有关。如果您没有（例如）Este的实例，该算法将完全忽略它-在这种情况下，您可能希望找到一些要包括的实例。

您可能还想在Google上搜索“不平衡类”。

Answer 2

尝试使用CatBoost（https://catboost.ai，https://github.com/catboost/catboost）-一种处理分类特征的梯度增强库。

处理类别变量-寻找建议

2 个答案: