Question

使用单次编码，一旦您在其中添加了1个值的列，然后说“ color ”，大熊猫get_dummies将会执行以下操作：

df = pd.DataFrame({'f1': ['red', 'yellow']})
df
Out[24]: 
       f1
0     red
1  yellow

pd.get_dummies(df)
Out[25]: 
   f1_red  f1_yellow
0       1          0
1       0          1

一个“ 热编码”问题将出现在您可能拥有颜色列表的地方，例如以下示例：

df = pd.DataFrame({'f1': ['red', ['yellow', 'blue']]})
df
Out[27]: 
               f1
0             red
1  [yellow, blue]

是否有任何优雅，智能，Python风格的方式（希望在Pandas中受支持）会为我带来以下结果：

   f1_red  f1_yellow  f1_blue
0       1          0        0
1       0          1        1

Answer 1

您可以通过list加入|，然后使用str.get_dummies：

s = df['f1'].apply(lambda x: '|'.join(x) if isinstance(x, list) else x)

df = s.str.get_dummies()
print (df)

   blue  red  yellow
0     0    1       0
1     1    0       1

如果性能很重要的另一种解决方案：

s = df['f1'].apply(lambda x: x if isinstance(x, list) else [x])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_)
print (df)
   blue  red  yellow
0     0    1       0
1     1    0       1

多热（N热）编码-快速熊猫方法？

1 个答案: