Question

我有一个带有两个分类列的数据框，其中包含同一组字符串，我想对其进行一次热编码。确定列可以包含的字符串集，并且两个列之间的一键编码必须保持一致。两列都包含所有可能的值，甚至多次。

在下面的示例中，我将编码器放在一个列表中，该列表包含各列可以包含的一组字符串。然后转换数据框的列。

问题1：这有意义吗？

问题2：如何对两列的一键编码返回的列有不同的名称？现在，我能够将列放入数据报中，但是它们具有相同的名称。有问题吧？如何避免呢？

#list of values
all_stuff = ['Boat','Bike']

#create dataframe
data = {'Stuff': ['Bike', 'Boat'], 'More Stuff': ['Boat', 'Bike']}
index = range(len(data['Stuff']))
columns = ['Stuff','More Stuff']
df = pd.DataFrame(data,  index=index, columns=columns)
df

#label encoder
label_encoder = LabelEncoder()
label_encoder.fit(all_stuff)
df['Stuff'] = label_encoder.transform(df['Stuff'])
df

df['More Stuff'] = label_encoder.transform(df['More Stuff'])
df

#one-hot encoding on first column (fit and transform)
enc = OneHotEncoder(handle_unknown='ignore')
stuff_cols = enc.fit(df['Stuff'].values.reshape(-1, 1))

stuff_cols = enc.transform(df['Stuff'].values.reshape(-1, 1)).toarray()
stuff_cols

df = pd.concat([df, pd.DataFrame(stuff_cols, columns=enc.get_feature_names())], axis=1)
df

#one hot enc on second column (ONLY tranform)
more_stuff_cols = enc.transform(df['More Stuff'].values.reshape(-1, 1)).toarray()
more_stuff_cols

df = pd.concat([df, pd.DataFrame(more_stuff_cols, columns=enc.get_feature_names())], axis=1)
df

#the column nales are the same!!

Answer 1

我认为您可以为此使用熊猫get_dummies功能

df = pd.DataFrame({'Stuff': ['Bike', 'Boat'], 'More Stuff': ['Boat', 'Bike']})
pd.get_dummies(df)

输出：

   Stuff_Bike  Stuff_Boat  More Stuff_Bike  More Stuff_Boat
0           1           0                0                1
1           0           1                1                0

将一热编码器安装在一列上，并适用于许多

1 个答案: