我一直在考虑将由列表组成的列拆分为单独的列。我有解决方案,但是速度很慢。
我有以下熊猫数据框
|basket |
|['two apple','A banana'] |
|['Red pear','A banana'] |
|['two apple','A banana','Red pear']|
我想转换为以下数据框。
|basket |two apple|A banana|Red pear|
|['two apple','A banana'] |1 |1 |0 |
|['Red pear','A banana'] |0 |1 |1 |
|['two apple','A banana','Red pear']|1 |1 |1 |
在创建了我需要的列之后,我具有以下代码:
for index,row in enumerate(df.basket):
if index>0 and index%10000==0:
print(index/len(df.baskets),' percent complete')
for n,col in enumerate(df.columns):
for pattern in row:
if col == pattern:
df[col,index]=1
break
随着行数的增加,这种情况将永远存在,我希望找到一种更高效的填充列的方法,即使我必须从列表的列中创建它们。
答案 0 :(得分:4)
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['basket']),
columns=mlb.classes_,
index=df.index))
print (df)
basket A banana Red pear two apple
0 [two apple, A banana] 1 0 1
1 [Red pear, A banana] 1 1 0
2 [two apple, A banana, Red pear] 1 1 1