我想从Excel文件中生成整洁的数据,看起来像这样,有三个级别的"合并"头:
Pandas使用多级标题读取文件很好:
# df = pandas.read_excel('test.xlsx', header=[0,1,2])
对于重复性,您可以复制粘贴:
df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})
我想对其进行规范化,以便级别标题位于变量行中,但将列a,b和c保留为列:
如果没有多级标题,我会pandas.melt(df, id_vars=['a', 'b', 'c'])
来获取我想要的内容。 pandas.melt(df)
为我提供了我想要的三个变量列,但显然不会保留a,b和c列。
答案 0 :(得分:2)
将DF分成两部分,以便于熔化并将它们连接起来。
first_half = df.iloc[:, :3]
second_half = df.iloc[:, 3:]
融化第二个片段。
melt_second_half = pd.melt(second_half)
通过将熔化的DF
中的行数除以其自身长度来计算找到的值,重复第一个片段中的值。
repeats = int(melt_second_half.shape[0]/first_half.shape[0])
first_reps = pd.concat([first_half] * repeats, ignore_index=True)
col_names = first_reps.columns.get_level_values(2)
melt_first_half = pd.DataFrame(first_reps.values, columns=col_names)
将两者连接起来并根据值列对生成的DF
进行排序。
df_concat = pd.concat([melt_first_half, melt_second_half], axis=1)
df_concat.sort_values('value').reset_index(drop=True)
答案 1 :(得分:2)
应该如此简单:
wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])
long_df = wide_df.stack().stack().stack()
这是一个模拟CSV文件的示例(请注意第4行标记索引,第1列标记标题级别):
from io import StringIO
from textwrap import dedent
import pandas
mockcsv = StringIO(dedent("""\
num,,,this1,this1,this1,this1,that1,that1,that1,that1
let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
a,b,c,,,,,,,,
a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))
wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()
所以wide_df
看起来像这样:
num this1 that1
let thisA thatA thisB thatB
animal cat dog bird lizard cat dog bird lizard
a b c
a1 b1 c1 x1 x2 x3 x4 x5 x6 x7 x8
c2 y1 y2 y3 y4 y5 y6 y7 y8
b2 c1 z1 z2 z3 z4 z5 6z zy z8
long_df
a b c animal let num
a1 b1 c1 bird thatA this1 x3
thatB that1 x7
cat thisA this1 x1
thisB that1 x5
dog thisA this1 x2
thisB that1 x6
lizard thatA this1 x4
thatB that1 x8
c2 bird thatA this1 y3
thatB that1 y7
cat thisA this1 y1
thisB that1 y5
dog thisA this1 y2
thisB that1 y6
lizard thatA this1 y4
thatB that1 y8
b2 c1 bird thatA this1 z3
thatB that1 zy
cat thisA this1 z1
thisB that1 z5
dog thisA this1 z2
thisB that1 6z
lizard thatA this1 z4
thatB that1 z8
通过OP中显示的文字数据,您可以通过执行以下操作来解决此问题:
index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
.reset_index(level=0, drop=True)
.rename_axis(index_names, axis='index')
.rename_axis(col_names, axis='columns')
.stack()
.stack()
.stack()
.to_frame()
)
我认为棘手的部分是检查每个文件以确定index_names
应该是什么。