Question

我正在尝试在python中运行逻辑回归。我的数据既包含数字数据，也包含分类数据。我想根据性别，年龄和食物偏好来预测某人是否喜欢猫。

我在想我需要对Food_preference进行一次热编码（请参见下文），但不确定如何执行。能否请你帮忙？谢谢！

原始数据框

Name    Gender  Age Like_cats   Food_preference
John    M   30  Yes Apple
John    M   30  Yes Orange
John    M   30  Yes Steak
Amy F   20  No  Apple
Amy F   20  No  Grape

所需数据框

Name    Gender  Age Like_cats   Apple   Orange  Steak   Grape
John    M   30  Yes 1   1   1   0
Amy F   20  No  1   0   0   1

Answer 1

您可以使用LabelEncoder将字符串特征转换为数字特征。

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder

这里的工作代码与您的数据结构相同：

from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn import preprocessing
import numpy as np



X = pd.DataFrame([['a', 0], ['b', 1], ['a', 5], ['b', 100]])
y = [0, 1, 0, 1]

X_n = [[]]*len(X.columns)

i = 0
for c in X.columns:
    if type(X[c].iloc[0]) == str: # if features are string encode them
        le = preprocessing.LabelEncoder()
        le.fit( list(set(X[c])) )
        X_n[i] = le.transform(X[c]) 
    else: # already numeric features
        X_n[i] = list(X[c])
    i += 1

X_n = np.array(X_n).T # transposing to make rows as per sample feature
print(X_n)

clf = LogisticRegression(random_state=0).fit(X_n, y)

转换数据框以进行逻辑回归（一种热编码）

1 个答案: