我有一个带有二进制列的 Pandas 数据框,如下所示:
DEM_HEALTH_PRIV DEM_HEALTH_PRE DEM_HEALTH_HOS DEM_HEALTH_OUT
0 1 0 0
0 0 1 1
我想取每个变量的后缀,并将二进制变量转换为与前缀对应的一个分类变量。例如,合并所有 DEM_HEALTH 变量以包含“PRE”、“HOS”、“OTH”等列的值等于 1 的列表。
Output
DEM_HEALTH_PRIV
['PRE']
['HOS','OUT']
任何帮助将不胜感激!
答案 0 :(得分:0)
试试这个 -
#original dataframe is called df
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
IIUC 您的数据如下所示
print(df)
DEM_HEALTH_PRIV DEM_HEALTH_OUT DEM_HEALTH_PRE DEM_HEALTH_HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
第一步是通过 "_"
子字符串的最后一次出现对列进行 rsplit(反向拆分)。然后创建多索引,DEM_HEALTH为level 0
,PRE、HOS等为level 1
。
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
print(df)
DEM_HEALTH
PRIV OUT PRE HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
0 [HOS, OUT, PRE]
1 [OUT]
2 [PRE]
3 [OUT]
4 [PRIV]
5 [HOS, PRE]
6 [PRE, PRIV]
7 [HOS, PRIV]
8 [OUT]
9 [OUT, PRE]
Name: level_1, dtype: object