通过pandas

时间:2016-10-29 13:26:20

标签: python excel pandas

我想从Excel文件中生成整洁的数据,看起来像这样,有三个级别的"合并"头:

enter image description here

Pandas使用多级标题读取文件很好:

# df = pandas.read_excel('test.xlsx', header=[0,1,2])

对于重复性,您可以复制粘贴:

df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})

我想对其进行规范化,以便级别标题位于变量行中,但将列a,b和c保留为列:

Desired output

如果没有多级标题,我会pandas.melt(df, id_vars=['a', 'b', 'c'])来获取我想要的内容。 pandas.melt(df)为我提供了我想要的三个变量列,但显然不会保留a,b和c列。

2 个答案:

答案 0 :(得分:2)

将DF分成两部分,以便于熔化并将它们连接起来。

first_half = df.iloc[:, :3]
second_half = df.iloc[:, 3:]

融化第二个片段。

melt_second_half = pd.melt(second_half)

通过将熔化的DF中的行数除以其自身长度来计算找到的值,重复第一个片段中的值。

repeats = int(melt_second_half.shape[0]/first_half.shape[0])
first_reps = pd.concat([first_half] * repeats, ignore_index=True)
col_names = first_reps.columns.get_level_values(2)
melt_first_half = pd.DataFrame(first_reps.values, columns=col_names)

将两者连接起来并根据列对生成的DF进行排序。

df_concat = pd.concat([melt_first_half, melt_second_half], axis=1)
df_concat.sort_values('value').reset_index(drop=True)

Image

答案 1 :(得分:2)

应该如此简单:

wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])

long_df = wide_df.stack().stack().stack()

这是一个模拟CSV文件的示例(请注意第4行标记索引,第1列标记标题级别):

from io import StringIO
from textwrap import dedent

import pandas

mockcsv = StringIO(dedent("""\
    num,,,this1,this1,this1,this1,that1,that1,that1,that1
    let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
    animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
    a,b,c,,,,,,,,
    a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
    a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
    a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))


wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()

所以wide_df看起来像这样:

num      this1                  that1                 
let      thisA     thatA        thisB     thatB       
animal     cat dog  bird lizard   cat dog  bird lizard
a  b  c                                               
a1 b1 c1    x1  x2    x3     x4    x5  x6    x7     x8
      c2    y1  y2    y3     y4    y5  y6    y7     y8
   b2 c1    z1  z2    z3     z4    z5  6z    zy     z8

long_df

a   b   c   animal  let    num  
a1  b1  c1  bird    thatA  this1    x3
                    thatB  that1    x7
            cat     thisA  this1    x1
                    thisB  that1    x5
            dog     thisA  this1    x2
                    thisB  that1    x6
            lizard  thatA  this1    x4
                    thatB  that1    x8
        c2  bird    thatA  this1    y3
                    thatB  that1    y7
            cat     thisA  this1    y1
                    thisB  that1    y5
            dog     thisA  this1    y2
                    thisB  that1    y6
            lizard  thatA  this1    y4
                    thatB  that1    y8
    b2  c1  bird    thatA  this1    z3
                    thatB  that1    zy
            cat     thisA  this1    z1
                    thisB  that1    z5
            dog     thisA  this1    z2
                    thisB  that1    6z
            lizard  thatA  this1    z4
                    thatB  that1    z8

通过OP中显示的文字数据,您可以通过执行以下操作来解决此问题:

index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
    pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
        .reset_index(level=0, drop=True)
        .rename_axis(index_names, axis='index')
        .rename_axis(col_names, axis='columns')
        .stack()
        .stack()
        .stack()
        .to_frame()
)

我认为棘手的部分是检查每个文件以确定index_names应该是什么。