熊猫合并具有相似前缀的列

时间:2021-04-10 02:23:33

标签: pandas

我有一个带有二进制列的 Pandas 数据框,如下所示:

DEM_HEALTH_PRIV  DEM_HEALTH_PRE  DEM_HEALTH_HOS  DEM_HEALTH_OUT
0                        1             0              0
0                        0             1              1

我想取每个变量的后缀,并将二进制变量转换为与前缀对应的一个分类变量。例如,合并所有 DEM_HEALTH 变量以包含“PRE”、“HOS”、“OTH”等列的值等于 1 的列表。

Output
DEM_HEALTH_PRIV 
['PRE']                      
['HOS','OUT']              

任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:0)

试试这个 -

#original dataframe is called df

new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]\
        .stack()\
        .reset_index(-1)\
        .groupby(level=0)['level_1']\
        .apply(list)

说明

IIUC 您的数据如下所示

print(df)

   DEM_HEALTH_PRIV  DEM_HEALTH_OUT  DEM_HEALTH_PRE  DEM_HEALTH_HOS
0                0               1               1               1
1                0               1               0               0
2                0               0               1               0
3                0               1               0               0
4                1               0               0               0
5                0               0               1               1
6                1               0               1               0
7                1               0               0               1
8                0               1               0               0
9                0               1               1               0

1.通过rsplit创建多索引

第一步是通过 "_" 子字符串的最后一次出现对列进行 rsplit(反向拆分)。然后创建多索引,DEM_HEALTH为level 0,PRE、HOS等为level 1

new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)

df.columns = new_cols
print(df)
  DEM_HEALTH            
        PRIV OUT PRE HOS
0          0   1   1   1
1          0   1   0   0
2          0   0   1   0
3          0   1   0   0
4          1   0   0   0
5          0   0   1   1
6          1   0   1   0
7          1   0   0   1
8          0   1   0   0
9          0   1   1   0

2. Stack 和 Groupby over level=0

data = df[df==1]\
        .stack()\
        .reset_index(-1)\
        .groupby(level=0)['level_1']\
        .apply(list)

0    [HOS, OUT, PRE]
1              [OUT]
2              [PRE]
3              [OUT]
4             [PRIV]
5         [HOS, PRE]
6        [PRE, PRIV]
7        [HOS, PRIV]
8              [OUT]
9         [OUT, PRE]
Name: level_1, dtype: object