我有类型的数据框
df = pd.DataFrame({'genres': [['Drama'], ['Music', 'Drama', 'Romance'],
['Action', 'Adventure', 'Comedy'],
['Thriller', 'Romance', 'Drama'],
['Adventure', 'Family']]
})
print(df)
genres = ['Action', 'Adventure', 'Comedy', 'Drama', 'Family', 'Music', 'Romance', 'Thriller'] # list of all genres
数据:
genres
0 [Drama]
1 [Music, Drama, Romance]
2 [Action, Adventure, Comedy]
3 [Thriller, Romance, Drama]
4 [Adventure, Family]
我希望输出如下:
genres Action Adventure Comedy Drama Family \
0 [Drama] 0 0 0 1 0
1 [Music, Drama, Romance] 0 0 0 1 0
2 [Action, Adventure, Comedy] 1 1 1 0 0
3 [Thriller, Romance, Drama] 0 0 0 1 0
4 [Adventure, Family] 0 1 0 0 1
Music Romance Thriller
0 0 0 0
1 1 1 0
2 0 0 0
3 0 1 1
4 0 0 0
答案 0 :(得分:6)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['genres']),columns=mlb.classes_, index=df.index)
df = df.join(df1)
print (df)
genres Action Adventure Comedy Drama Family \
0 [Drama] 0 0 0 1 0
1 [Music, Drama, Romance] 0 0 0 1 0
2 [Action, Adventure, Comedy] 1 1 1 0 0
3 [Thriller, Romance, Drama] 0 0 0 1 0
4 [Adventure, Family] 0 1 0 0 1
Music Romance Thriller
0 0 0 0
1 1 1 0
2 0 0 0
3 0 1 1
4 0 0 0
如果要按列表过滤类型,请添加reindex
:
genres = ['Action', 'Adventure', 'Comedy', 'Drama']
df1 = pd.DataFrame(mlb.fit_transform(df['genres']),columns=mlb.classes_, index=df.index)
df = df.join(df1.reindex(columns=genres, fill_value=0))
print (df)
genres Action Adventure Comedy Drama
0 [Drama] 0 0 0 1
1 [Music, Drama, Romance] 0 0 0 1
2 [Action, Adventure, Comedy] 1 1 1 0
3 [Thriller, Romance, Drama] 0 0 0 1
4 [Adventure, Family] 0 1 0 0