Question

我正在尝试了解如何将分类数据用作sklearn.linear_model的{{1}}中的功能。

我理解当然需要对其进行编码。

我不明白的是如何将编码特征传递给Logistic回归，因此它被作为分类特征处理，而不是解释编码作为标准可量化特征时得到的int值。
（不太重要）有人可以解释使用LogisticRegression，preprocessing.LabelEncoder()或仅使用简单的dict编码分类数据之间的区别吗？ Alex A.'s comment here触及了这个主题，但影响不大。

特别是第一个！

Answer 1

您可以为不同的类别创建指标变量。例如：

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

然后我们有：

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

您可以将它们连接到原始数据矩阵上：

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]

如果数据矩阵中包含常数项，请记住保留一个没有指标的类别！否则，您的数据矩阵不会是完整的列级别（或者在计量经济学术语中，您具有多重共线性）。

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.

Answer 2

将分类特征转换为数字的标准方法 - OneHotEncoding
这是完全不同的课程：

mConnect.login(id, password) .subscribe(new Subscriber<Token>() { @Override public void onCompleted() { Log.d(TAG, "onCompleted()"); } @Override public void onError(Throwable e) { Log.e(TAG, "onError(): " + e); if (e instanceof HttpException) { // dump e.response().errorBody() } } @Override public void onNext(Token token) { Log.d(TAG, "onNext(): " + token); } });

字典映射要素名称到要素索引。

即在[DictVectorizer][2].vocabulary_ fit()具有所有可能的功能名称之后，现在它知道在哪个特定列中它将放置特征的特定值。因此DictVectorizer包含标记的功能，但不包含值。

DictVectorizer.vocabulary_在相反的映射中将每个可能的标签（Label可以是字符串或整数）转换为某个整数值，并返回这些整数值的1D向量。

Answer 3

假设每个分类变量的类型是“对象”。首先，您可以创建panda.index分类列名称：

import pandas as pd    
catColumns = df.select_dtypes(['object']).columns

然后，您可以使用下面的for循环创建指标变量。对于二进制分类变量，请使用LabelEncoder()将其转换为0和1。对于具有两个以上类别的分类变量，使用pd.getDummies()获取指标变量，然后删除一个类别（以避免多重共线性问题）。

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
       X = pd.get_dummies(df[col])
       X = X.drop(X.columns[0], axis=1)
       df[X.columns] = X
       df.drop(col, axis=1, inplace=True)  # drop the original categorical variable (optional)
    else:
       le.fit(df[col])
       df[col] = le.transform(df[col])

使用分类数据作为sklean LogisticRegression中的功能

3 个答案: