我有一个(400,328)数据框,其结构如下:
row_idx = pd.MultiIndex.from_product([['EU', 'ROW'],
['p01.a', 'p01.b', 'p02.1.a', 'p02.1.b', 'p02.1.c', 'p03']],
names=['Region', 'Prod_code'])
col_idx = pd.MultiIndex.from_product([['EU', 'ROW'],
['i01.a', 'i01.b', 'i02.1.a', 'i03']],
names=['Region', 'Ind_code'])
df_in = pd.DataFrame(np.random.randint(1,10,(12,8)), index=row_idx, columns=col_idx)
print(df)
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 1 9 7 4 2 1 6 7
p01.b 1 5 1 7 2 4 2 2
p02.1.a 1 1 2 8 8 4 4 7
p02.1.b 7 7 7 5 6 7 1 3
p02.1.c 4 2 4 4 6 4 3 8
p03 7 2 9 8 8 8 4 3
ROW p01.a 4 4 5 5 5 1 6 2
p01.b 5 2 3 4 9 4 9 6
p02.1.a 4 4 8 8 4 7 6 6
p02.1.b 7 9 3 2 1 5 4 1
p02.1.c 4 2 1 2 9 8 8 5
p03 6 7 6 6 6 9 7 5
我需要通过将所有Prod_code与Ind_code不对应的所有行加起来来获得对称数据帧(328,328)(忽略初始字母“ i”和“ p”)。 “额外”行-在这种情况下('..','p02.1.b')和('..','p02.1.c')-应该加到第一行,并带有相应的父级代码-在这种情况下('..','p02.1.a'),如下所示。
#Desired output
print(df_out)
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 1 9 7 4 2 1 6 7
p01.b 1 5 1 7 2 4 2 2
p02.1.a 12 11 13 17 20 15 8 18
p03 7 2 9 8 8 8 4 3
ROW p01.a 4 4 5 5 5 1 6 2
p01.b 5 2 3 4 9 4 9 6
p02.1.a 15 15 12 12 14 20 18 12
p03 6 7 6 6 6 9 7 5
如何以一种优雅的“ Pythonic”方式做到这一点?
答案 0 :(得分:2)
您可以尝试将第一个索引级别0和索引级别1与4位数字的切片组合在一起,并应用数据帧值的总和 df
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 6 6 7 1 7 7 8 3
p01.b 8 6 6 7 7 1 2 9
p02.1.a 3 9 1 5 9 4 1 3
p02.1.b 4 2 1 7 1 4 8 8
p02.1.c 3 1 6 9 7 8 4 1
p03 2 2 3 8 1 6 3 7
ROW p01.a 8 4 9 7 7 9 1 6
p01.b 7 8 3 3 7 9 7 3
p02.1.a 7 3 4 5 7 7 7 4
p02.1.b 5 5 6 7 7 2 9 7
p02.1.c 4 8 7 5 3 7 7 8
p03 3 3 3 9 9 6 3 8
# Assigning level 1 index to variable to keep the original Index
level1_index = df.index.get_level_values(0) + '_'+ df.index.get_level_values(1)
# Applying Groupby and extracting the first position index of every grouped rows
level1_index = list(map(lambda x: x[0].split('_')[1],level1_index.groupby(level1_index.str.slice(stop=9)).values()))
# Groupin the dataframe on level 0 and level 1 indexes
df = df.groupby([df.index.get_level_values(0),df.index.get_level_values(1).str.slice(stop=5)]).sum()
# Assigning level 1 index back to the dataframe
df.index.set_levels(level1_index,level=1,verify_integrity=False,inplace=True)
出局:
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 6 6 7 1 7 7 8 3
p01.b 8 6 6 7 7 1 2 9
p02.1.a 10 12 8 21 17 16 13 12
p03 2 2 3 8 1 6 3 7
ROW p01.a 8 4 9 7 7 9 1 6
p01.b 7 8 3 3 7 9 7 3
p02.1.a 16 16 17 17 17 16 23 19
p03 3 3 3 9 9 6 3 8
答案 1 :(得分:0)
由于另一个答案未保留多索引,因此要保留和计算使用,请执行以下操作:
print(df_in)
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 5 9 6 3 4 7 7 3
p01.b 9 6 4 6 9 2 6 4
p02.1 7 1 4 6 4 2 7 4
p02.1 7 3 3 8 1 6 6 8
p02.1 2 4 7 8 9 5 5 3
p03 7 7 6 3 5 7 8 1
ROW p01.a 3 3 3 7 5 7 4 4
p01.b 8 8 1 4 4 3 3 3
p02.1 8 5 3 6 6 4 4 3
p02.1 8 1 3 5 5 5 6 5
p02.1 1 7 1 4 9 3 6 3
p03 3 6 1 5 1 8 4 1
输出:
#getting level 1 values of multi-index
Prod_code = df_in.index.get_level_values(1)
#Assing these values to `Prod_code` column
df_in['Prod_code'] = Prod_code
#Setting the level 1 with values that contain duplicates
df_in.index.set_levels(Prod_code.str.slice(start=0,stop=5,step=1),
level=1,verify_integrity=False,inplace=True)
#Getting the values with only one duplicated value as per OP's condition
#Using groupby on level=0,1 of multi-index
level_0 = df_in.index.get_level_values(0)
level_1 = df_in.index.get_level_values(1)
valuestoset = df_in.groupby([level_0,level_1])['Prod_code'].first()
#Finding the sum on groupby object on level=0,1 of multi-index which contains duplicates
df_out = df_in.groupby([level_0,level_1]).sum()
#Finally setting the valuestoset to multi-index to preseve the order
df_out.index.set_levels(valuestoset,level=1,verify_integrity=False,inplace=True)
print(df_out)
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 5 9 6 3 4 7 7 3
p01.b 9 6 4 6 9 2 6 4
p02.1.a 16 8 14 22 14 13 18 15
p03 7 7 6 3 5 7 8 1
ROW p01.a 3 3 3 7 5 7 4 4
p01.b 8 8 1 4 4 3 3 3
p02.1.a 17 13 7 15 20 12 16 11
p03 3 6 1 5 1 8 4 1
说明:
print(df_in.index.get_level_values(1))
Index(['p01.a', 'p01.b', 'p02.1.a', 'p02.1.b', 'p02.1.c', 'p03', 'p01.a',
'p01.b', 'p02.1.a', 'p02.1.b', 'p02.1.c', 'p03'],
dtype='object', name='Prod_code')
Prod_code = df_in.index.get_level_values(1)
df_in['Prod_code'] = index_col
df_in.index.set_levels(Prod_code.str.slice(start=0,stop=5,step=1),
level=1,verify_integrity=False,inplace=True)
print(df_in)
Region EU ROW Prod_code
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 5 9 6 3 4 7 7 3 p01.a
p01.b 9 6 4 6 9 2 6 4 p01.b
p02.1 7 1 4 6 4 2 7 4 p02.1.a
p02.1 7 3 3 8 1 6 6 8 p02.1.b
p02.1 2 4 7 8 9 5 5 3 p02.1.c
p03 7 7 6 3 5 7 8 1 p03
ROW p01.a 3 3 3 7 5 7 4 4 p01.a
p01.b 8 8 1 4 4 3 3 3 p01.b
p02.1 8 5 3 6 6 4 4 3 p02.1.a
p02.1 8 1 3 5 5 5 6 5 p02.1.b
p02.1 1 7 1 4 9 3 6 3 p02.1.c
p03 3 6 1 5 1 8 4 1 p03
df_in.groupby([level_0,level_1])['Prod_code'].first()
Region Prod_code
EU p01.a p01.a
p01.b p01.b
p02.1 p02.1.a
p03 p03
ROW p01.a p01.a
p01.b p01.b
p02.1 p02.1.a
p03 p03
Name: Prod_code, dtype: object
df_in.groupby([level_0,level_1]).sum()
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 5 9 6 3 4 7 7 3
p01.b 9 6 4 6 9 2 6 4
p02.1 16 8 14 22 14 13 18 15
p03 7 7 6 3 5 7 8 1
ROW p01.a 3 3 3 7 5 7 4 4
p01.b 8 8 1 4 4 3 3 3
p02.1 17 13 7 15 20 12 16 11
p03 3 6 1 5 1 8 4 1
df_out.index.set_levels(valuestoset,level=1,verify_integrity=False,inplace=True)
Region EU ROW
Ind_code i01.a i01.b i02.1.a i03 i01.a i01.b i02.1.a i03
Region Prod_code
EU p01.a 5 9 6 3 4 7 7 3
p01.b 9 6 4 6 9 2 6 4
p02.1.a 16 8 14 22 14 13 18 15
p03 7 7 6 3 5 7 8 1
ROW p01.a 3 3 3 7 5 7 4 4
p01.b 8 8 1 4 4 3 3 3
p02.1.a 17 13 7 15 20 12 16 11
p03 3 6 1 5 1 8 4 1