我正在尝试为我的数据框编码单热。这是一个多维数组,我不知道如何做到这一点。数据框可能如下所示:
df = pd.DataFrame({'menu': [['Italian', 'Greek'], ['Japanese'], ['Italian','Greek', 'Japanese']], 'price': ['$$', '$$', '$'], 'location': [['NY', 'CA','MI'], 'CA', ['NY', 'CA','MA']]})
我想要的输出是这样的:
df2 = pd.DataFrame({'menu': [[1,1,0], [0,0,1], [1,1,1]], 'price': [[1,0], [1,0], [0,1]], 'location': [[1,1,1,0], [0,1,0,0], [1,1,0,1]]})
我不确定如何使用pd.get_dummies或scikit-learn完成此操作。 有人能帮我吗?
答案 0 :(得分:4)
您可以使用:
#create list with one item values
df = df.applymap(lambda x: x if isinstance(x, list) else [x])
print (df)
location menu price
0 [NY, CA, MI] [Italian, Greek] [$$]
1 [CA] [Japanese] [$$]
2 [NY, CA, MA] [Italian, Greek, Japanese] [$]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
#create Series for each column by list comprehension
vals = [pd.Series(mlb.fit_transform(df[x]).tolist()) for x in df.columns]
#concat to df
df2 = pd.concat(vals, keys=df.columns, axis=1)
print (df2)
location menu price
0 [1, 0, 1, 1] [1, 1, 0] [0, 1]
1 [1, 0, 0, 0] [0, 0, 1] [0, 1]
2 [1, 1, 0, 1] [1, 1, 1] [1, 0]