我有一个大型的excel文件,其中包含数千行和大约100列。问题在于索引列包含大约50个度量标准(销售,房屋,人)和70个公司。我真正想要的是有两个索引,一个用于度量标准,一个用于公司。以下面的代码为例:
import pandas as pd
import numpy as np
idx=['Sales','Company 1', 'Company 2', 'Company 3','Houses','Company 1',
'Company 2', 'Company 3','People','Company 1', 'Company 2', 'Company 3']
dt=['2010','2011','2012','2013']
data = np.array([np.arange(12)]*4).T
df = pd.DataFrame(data, index=idx, columns=dt)
df.iloc[4,::]=0;df.iloc[8,::]=0
df
结果看起来像附件中的图像 the documentation they've provided
我的问题是我将如何操作数据框,以便第一个索引将是Sales,Sales,Sales ....,而第二个索引将是每个指标(销售,房屋等)的Company 1,Company 2,Company 3。 )?
答案 0 :(得分:2)
设置
c = 3 # number of companies
metrics = df.index[::c+1]
companies = df.index[1:c+1]
此答案仅采用指标,找到公司,创建MultiIndex
并重新分配。该假设基于以下假设:公司在每个指标之间的顺序相同:
idx = pd.MultiIndex.from_product([metrics, companies])
df.drop(df.index[::c+1]).set_index(idx)
2010 2011 2012 2013
Sales Company 1 1 1 1 1
Company 2 2 2 2 2
Company 3 3 3 3 3
Houses Company 1 5 5 5 5
Company 2 6 6 6 6
Company 3 7 7 7 7
People Company 1 9 9 9 9
Company 2 10 10 10 10
Company 3 11 11 11 11
如果您不能保证此约束,它将变得有些棘手:
u = pd.Series(df.index)
idx = u.groupby(u.index // (c + 1)).transform('first') + '|' + u
f = df.drop(df.index[::c+1])
f[['metric', 'company']] = (idx.drop(idx.index[::c+1])
.str.split('|', expand=True).set_index(f.index))
f.set_index(['metric', 'company'])
2010 2011 2012 2013
metric company
Sales Company 1 1 1 1 1
Company 2 2 2 2 2
Company 3 3 3 3 3
Houses Company 1 5 5 5 5
Company 2 6 6 6 6
Company 3 7 7 7 7
People Company 1 9 9 9 9
Company 2 10 10 10 10
Company 3 11 11 11 11