我想转换这个DF
pd.DataFrame({"l1": [["fr en","en"]],
"l2": [["fr en","in","it"]],
"l3": [["he","es","fi"]],
"l4": [["es"]]}).T
>> l1 [fr en, en]
...
l4 [es]
此DTM:
data = [[1,1,0,0,0,0,0], [1,0,1,1,0,0,0], [0,0,0,0,1,1,1], [0,0,0,0,0,1,1]]
pd.DataFrame(index=["l1","l2","l3","l4"], data=data, columns=["fr en","en","in","it","he","es","fi"])
>> fr en en in it he es fi
l1 1 1 0 0 0 0 0
... ...
我效率不高的方法是先chain
,然后将所有可能的值都计算成
langs = set(chain(*df["lang"]))
pd.DataFrame(data=df["lang"].apply(lambda x: [1 if lang in x else 0 for lang in langs]).tolist(), columns=langs)
PS:我不想" ".join()
列出这些列表,因为您可能会在fr en
中看到这些列表,这可能代表信息丢失
答案 0 :(得分:2)
我认为需要MultiLabelBinarizer
:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df[0]),columns=mlb.classes_, index=df.index)
print (df)
en es fi fr en he in it
l1 1 0 0 1 0 0 0
l2 0 0 0 1 0 1 1
l3 0 1 1 0 1 0 0
l4 0 1 0 0 0 0 0
或者如果数据中不存在分隔符,则可以通过|
使用较慢的连接解决方案:
df = df[0].str.join('|').str.get_dummies()
print (df)
en es fi fr en he in it
l1 1 0 0 1 0 0 0
l2 0 0 0 1 0 1 1
l3 0 1 1 0 1 0 0
l4 0 1 0 0 0 0 0