优化将列表的列拆分为单独的列

时间:2018-07-19 04:43:25

标签: python pandas dataframe

我一直在考虑将由列表组成的列拆分为单独的列。我有解决方案,但是速度很慢。

我有以下熊猫数据框

|basket                             |
|['two apple','A banana']           |
|['Red pear','A banana']            |
|['two apple','A banana','Red pear']|

我想转换为以下数据框。

|basket                             |two apple|A banana|Red pear|
|['two apple','A banana']           |1        |1       |0       |
|['Red pear','A banana']            |0        |1       |1       |
|['two apple','A banana','Red pear']|1        |1       |1       |

在创建了我需要的列之后,我具有以下代码:

for index,row in enumerate(df.basket):
    if index>0 and index%10000==0:
        print(index/len(df.baskets),' percent complete')
    for n,col in enumerate(df.columns):
        for pattern in row:
            if col == pattern:
                df[col,index]=1
                break

随着行数的增加,这种情况将永远存在,我希望找到一种更高效的填充列的方法,即使我必须从列表的列中创建它们。

1 个答案:

答案 0 :(得分:4)

使用MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df['basket']),
                  columns=mlb.classes_, 
                  index=df.index))
print (df)
                            basket  A banana  Red pear  two apple
0            [two apple, A banana]         1         0          1
1             [Red pear, A banana]         1         1          0
2  [two apple, A banana, Red pear]         1         1          1