我有一个pandas数据框,其中包含一个单元格内的值列表。如果列值在该行的列表内,则需要将这些值转换为包含true或false的列。我需要为每一行列表中的每个唯一值添加一列。
这是我的数据框:
data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
df = pd.DataFrame(data)
agency_id province
0 1 [CH, PE]
1 3 [CH, CS]
创建初始数据框。
然后我尝试:
df2 = pd.DataFrame(df['province'].values.tolist(),index=df['agency_id'])
并输出以下内容:
0 1 2 3 4 5 6 7
agency_id
1 CH PE AQ TE None None None None
3 KR CS None None None None None None
7 FE FC BO MO RA RE RN PR
8 None None None None None None None None
10 RM None None None None None None None
11 RM None None None None None None None
但这不是我想要的,因为列未“对齐”。
我需要这样的东西:
agency_id CH PE CS
1 true true false
3 true false true
答案 0 :(得分:3)
来自sklearn
MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df['province']),columns=mlb.classes_, index=df.agency_id).astype(bool)
Out[90]:
CH CS PE
agency_id
1 True False True
3 True True False
答案 1 :(得分:2)
如果您不希望为此导入data
,则可以清理/修改from sklearn.preprocessing import MultiLabelBinarizer
:
import pandas as pd
data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
# get all provinces from any included dictionaries of data:
all_prov = sorted(set( (x for y in [d["province"] for d in data] for x in y) ))
# add the missing key:values to your data's dicts:
for d in data:
for p in all_prov:
d[p] = p in d["province"]
print(data)
df = pd.DataFrame(data)
print(df)
输出:
# data
[{'agency_id': 1, 'province': ['CH', 'PE'], 'CH': True, 'CS': False, 'PE': True},
{'agency_id': 3, 'province': ['CH', 'CS'], 'CH': True, 'CS': True, 'PE': False}]
# df
CH CS PE agency_id province
0 True False True 1 [CH, PE]
1 True True False 3 [CH, CS]
答案 2 :(得分:0)
另一种解决方案,仅使用pandas
:
import pandas as pd
data = [
{"agency_id": 1,"province": ["CH", "PE"]},
{"agency_id": 3,"province": ["CH", "CS"]}
]
df = pd.DataFrame(data)
result = df['province'].apply(lambda x: '|'.join(x)).str.get_dummies().astype(bool).set_index(df.agency_id)
print(result)
输出
CH CS PE
agency_id
1 True False True
3 True True False