一键编码sklearn中的多个列和命名列

时间:2019-03-18 20:09:42

标签: python python-3.x pandas scikit-learn one-hot-encoding

我有以下代码对我拥有的两列进行一次热编码。

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

我想知道的是如何在4行代码中执行此操作,同时在输出中获取正确命名的列。也就是说,我可以通过在fit_transform中包含两个列名称来创建一个正确的单编码数组,但是当我尝试命名结果数据框的列时,它告诉我索引的形状之间不匹配:

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

对于背景,电话和城市都有3个值。

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android

3 个答案:

答案 0 :(得分:2)

您快要出现了……就像您说的那样,您可以直接添加要在fit_transform中编码的所有列。

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

然后您只需执行以下操作:

feature_labels = np.array(feature_labels).ravel()

这使您可以根据需要命名列:

features = pd.DataFrame(feature_arr, columns=feature_labels)

答案 1 :(得分:1)

您为什么不看pd.get_dummies? 这是编码方式:

df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)

答案 2 :(得分:0)

<块引用>
cat_features = [
    "gender", "cholesterol", "gluc", "smoke", "alco"
]

data = pd.get_dummies(data, columns = cat_features)