我想知道如何对包含字符串数组的列进行一次热编码。
我正试图从df升级到df2:
import pandas as pd
# This is the original data frame
df = pd.DataFrame({'menu': [['Italian', 'Greek'], ['Japanese'],
['Italian','Greek', 'Japanese']], 'price': ['$$', '$$', '$']})
df.head()
# This is the desired result
df2 = pd.DataFrame({'menu': [['Italian', 'Greek'], ['Japanese'],
['Italian','Greek', 'Japanese']],
'price': ['$$', '$$', '$'],
'Italian': [1,0,1],
'Greek': [1,0,1],
'Japanese': [0,1,1]
})
df2.head()
答案 0 :(得分:5)
将MultiLabelBinarizer
与join
一起使用:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['menu']),columns=mlb.classes_))
print (df)
menu price Greek Italian Japanese
0 [Italian, Greek] $$ 1 1 0
1 [Japanese] $$ 0 0 1
2 [Italian, Greek, Japanese] $ 1 1 1
答案 1 :(得分:4)
您可以使用pd.get_dummies
,pd.apply
,DataFrame.join
和Series.stack
df.join(pd.get_dummies(df.menu.apply(pd.Series).stack()).sum(level=0))
输出:
menu price Greek Italian Japanese
0 [Italian, Greek] $$ 1 1 0
1 [Japanese] $$ 0 0 1
2 [Italian, Greek, Japanese] $ 1 1 1