Pandas-拆分多索引数据帧

时间:2017-05-16 19:28:00

标签: python pandas multiple-columns

我有一个多索引的pandas数据框,我想用“||”分隔值字符并包含一个索引层,其中包含三个新列'Connection','Val1''Val2'。

如果有人能给我提供一些提示,那将会很有帮助。

当前示例数据框:

                              Experiment1                        Experiment2      
    Target      Analyze1_ab               Analyze2_zz            Analyze1_yy        
    XXX_1   Edge2||3.1E-07||-0.5    Edge2||2.1E-06||-0.9     Edge2||6.4E-02||-0.3   
    XXX_4   Edge1||6.4E-12||1.1     Edge1||2.4E-11||9.4      Edge1||1.4E-11||1.4    
    ABC_1   Edge1||3.9E-07||0.7     Edge1||2.9E-07||5.6      Edge1||6.8E-02||0.4    
    ABC_2   Edge2||1.1E-09||-0.5    Edge2||1.2E-09||1.2      Edge2||1.0E-03||-0.5   
    ABC_3   Edge2||4.6E-25||-0.8    Edge2||2.6E-10||1.9      Edge2||5.0E-17||-0.9   
    XXX_2   Edge2||1.7E-07||-0.5    Edge2||5.7E-08||-0.3     Edge2||4.1E-02||-0.3   
    ABC_4   Edge1||8.1E-02||0.5     Edge1||9.1E-02||1.5      Edge1||5.4E-02||0.6    
    ABC_5   Edge1||6.7E-02||0.3     Edge1||4.2E-02||1.9      Edge1||5.6E-03||0.4    
    XXX_3   Edge2||3.1E-03||-0.4    Edge1||2.4E-11||1.1      Edge2||2.4E-02||-0.3   

所需数据框:

                                   Experiment1                            Experiment2       
    Target          Analyze1_ab                  Analyze2_zz                Analyze1_yy     
           Connection   Val1    Val2    Connection  Val1    Val2    Connection  Val1    Val2
    XXX_1   Edge2   3.10E-07    -0.5    Edge2   2.10E-06    -0.9    Edge2   6.40E-02    -0.3
    XXX_4   Edge1   6.40E-12    1.1     Edge1   2.40E-11    9.4     Edge1   1.40E-11    1.4
    ABC_1   Edge1   3.90E-07    0.7     Edge1   2.90E-07    5.6     Edge1   6.80E-02    0.4
    ABC_2   Edge2   1.10E-09    -0.5    Edge2   1.20E-09    1.2     Edge2   1.00E-03    -0.5
    ABC_3   Edge2   4.60E-25    -0.8    Edge2   2.60E-10    1.9     Edge2   5.00E-17    -0.9
    XXX_2   Edge2   1.70E-07    -0.5    Edge2   5.70E-08    -0.3    Edge2   4.10E-02    -0.3
    ABC_4   Edge1   8.10E-02    0.5     Edge1   9.10E-02    1.5     Edge1   5.40E-02    0.6
    ABC_5   Edge1   6.70E-02    0.3     Edge1   4.20E-02    1.9     Edge1   5.60E-03    0.4
    XXX_3   Edge2   3.10E-03    -0.4    Edge1   2.40E-11    1.1     Edge2   2.40E-02    -0.3

3 个答案:

答案 0 :(得分:4)

<强>设置

df
Out[2319]: 
      ID           Experiment1                                 Experiment2
  Target           Analyze1_ab           Analyze2_ab           Analyze1_yy
0  XXX_1  Edge2||3.1E-07||-0.5  Edge2||2.1E-06||-0.9  Edge2||6.4E-02||-0.3
1  XXX_4   Edge1||6.4E-12||1.1   Edge1||2.4E-11||9.4   Edge1||1.4E-11||1.4
2  ABC_1   Edge1||3.9E-07||0.7   Edge1||2.9E-07||5.6   Edge1||6.8E-02||0.4
3  ABC_2  Edge2||1.1E-09||-0.5   Edge2||1.2E-09||1.2  Edge2||1.0E-03||-0.5
4  ABC_3  Edge2||4.6E-25||-0.8   Edge2||2.6E-10||1.9  Edge2||5.0E-17||-0.9
5  XXX_2  Edge2||1.7E-07||-0.5  Edge2||5.7E-08||-0.3  Edge2||4.1E-02||-0.3
6  ABC_4   Edge1||8.1E-02||0.5   Edge1||9.1E-02||1.5   Edge1||5.4E-02||0.6
7  ABC_5   Edge1||6.7E-02||0.3   Edge1||4.2E-02||1.9   Edge1||5.6E-03||0.4
8  XXX_3  Edge2||3.1E-03||-0.4   Edge1||2.4E-11||1.1  Edge2||2.4E-02||-0.3

<强>解决方案

#split columns by '||' and rebuild a Dataframe with the separated columns
df2 = pd.DataFrame(np.asarray(df.iloc[:,1:].apply(lambda x: x.str.split('\|\|')).values.tolist()).reshape(9,-1))
#set Multilevel columns
df2.columns=pd.MultiIndex.from_tuples([('Experiment1','Analyze1_ab','Connection'),
('Experiment1','Analyze1_ab','Val1'),
('Experiment1','Analyze1_ab','Val2'),
('Experiment1','Analyze2_zz','Connection'),
('Experiment1','Analyze2_zz','Val1'),
('Experiment1','Analyze2_zz','Val2'),
('Experiment2','Analyze1_yy','Connection'),
('Experiment2','Analyze1_yy','Val1'),
('Experiment2','Analyze1_yy','Val2')])
#add Target column
df2.insert(0,'Target',df.iloc[:,0])

Out[2324]: 
  Target Experiment1                                           Experiment2  
         Analyze1_ab                Analyze2_zz                Analyze1_yy   
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2 
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3 
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4 
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4 
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5 
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9 
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3 
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6 
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4 
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3

答案 1 :(得分:2)

import pandas as pd

# Initialize DataFrame
# -----------------------------------------------------------------------------

df = pd.DataFrame({
    'Analyze1_ab': ['Edge2||3.1E-07||-0.5', 'Edge1||6.4E-12||1.1'],
    'Analyze2_zz': ['Edge2||2.1E-06||-0.9', 'Edge1||2.4E-11||9.4'],
    'Analyze1_yy': ['Edge2||6.4E-02||-0.3', 'Edge1||1.4E-11||1.4'],
    'Target': ['XXX_1', 'XXX_4'],})

df.columns = pd.MultiIndex.from_tuples(
    [('Experiment1', 'Analyze1_ab'),
     ('Experiment2', 'Analyze1_yy'),
     ('Experiment1', 'Analyze2_zz'),
     ('Target', '')])

# Split 'Analyses' columns by double pipes ||
# -----------------------------------------------------------------------------

# Initialize final DataFrame
final_df = pd.DataFrame()

for col_name in df.columns:
    if (col_name[1].startswith('Analyze') and
            df[col_name].str.contains('||').all()):
        # Split 'Analysis' by || into new columns
        splitted_analysis = df[col_name].str.split('\|\|', expand=True)
        # The new column names are 0, 1, 2. Let's rename them.
        splitted_analysis.columns = ['Connection', 'Val1', 'Val2']
        # Recreate MultiIndex
        splitted_analysis.columns = pd.MultiIndex.from_tuples(
            [(col_name[0], col_name[1], c) for c in splitted_analysis.columns])
        # Concatenate the new columns to the final_df
        final_df = pd.concat(objs=[final_df, splitted_analysis], axis=1)

# Add 'Target' column in the final_df.
# First, extract it.
target_col = pd.DataFrame(df[('Target', '')])
# Then, increase MultiIndex level of 'Target' from 2 to 3,
# to allow smooth concatenation with the final_df.
target_col.columns = pd.MultiIndex.from_tuples([('Target', '', '')])
final_df = pd.concat([final_df, target_col], axis=1)

验证:print(final_df)

  Experiment1                Experiment2                Experiment1                Target
  Analyze1_ab                Analyze1_yy                Analyze2_zz                      
   Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2       
0       Edge2  3.1E-07  -0.5       Edge2  6.4E-02  -0.3       Edge2  2.1E-06  -0.9  XXX_1
1       Edge1  6.4E-12   1.1       Edge1  1.4E-11   1.4       Edge1  2.4E-11   9.4  XXX_4

验证:pprint.pprint([c for c in final_df.columns])

[('Experiment1', 'Analyze1_ab', 'Connection'),
 ('Experiment1', 'Analyze1_ab', 'Val1'),
 ('Experiment1', 'Analyze1_ab', 'Val2'),
 ('Experiment2', 'Analyze1_yy', 'Connection'),
 ('Experiment2', 'Analyze1_yy', 'Val1'),
 ('Experiment2', 'Analyze1_yy', 'Val2'),
 ('Experiment1', 'Analyze2_zz', 'Connection'),
 ('Experiment1', 'Analyze2_zz', 'Val1'),
 ('Experiment1', 'Analyze2_zz', 'Val2'),
 ('Target', '', '')]

答案 2 :(得分:1)

你可以这样做:

In [319]: x = df.iloc[:, 1:]
     ...: x.columns = list('abc')
     ...: x = x.stack().str.extractall('([^\|]+)').unstack([-1, -2])
     ...: x.columns = range(9)
     ...:
     ...: x = df.iloc[:, 0].to_frame('Target').join(x)
     ...:
     ...: x.columns = \
     ...: pd.MultiIndex.from_tuples([('', 'Target', '')]) \
     ...:   .union(pd.MultiIndex.from_tuples(
     ...:             [(t[0][0], t[0][1], t[1])
     ...:              for t in product(df.columns[1:].to_series().tolist(),
     ...:              ['Connection', 'Val1', 'Val2'])
     ...:             ]))
     ...:

<强>结果:

In [320]: x
Out[320]:
         Experiment1                                           Experiment2
  Target Analyze1_ab                Analyze2_zz                Analyze1_yy
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3