如何将pandas dataframe的值除以每组的第一行?

时间:2016-11-22 01:42:16

标签: python pandas normalization

pandas数据帧:

>>> df
                  sales  net_pft
STK_ID RPT_Date                 
002138 20140930   3.325    0.607
       20150930   3.619    0.738
       20160930   4.779    0.948
600004 20140930  13.986    2.205
       20150930  14.226    3.080
       20160930  15.499    3.619
600660 20140930  31.773    5.286
       20150930  31.040    6.333
       20160930  40.062    7.186

只想知道如何获取输出,因为每行的值除以每组的第一行,如下所示:

                  sales  net_pft
STK_ID RPT_Date                 
002138 20140930   1.000    1.000
       20150930   1.088    1.216
       20160930   1.437    1.562
600004 20140930   1.000    1.000
       20150930   1.017    1.397
       20160930   1.108    1.641
600660 20140930   1.000    1.000
       20150930   0.977    1.198
       20160930   1.261    1.359

谢谢,

1 个答案:

答案 0 :(得分:1)

import pandas as pd

df = pd.DataFrame({'RPT_Date': ['20140930', '20150930', '20160930', '20140930', '20150930', '20160930', '20140930', '20150930', '20160930'], 'STK_ID': ['002138', '002138', '002138', '600004', '600004', '600004', '600660', '600660', '600660'], 'net_pft': [0.607, 0.738, 0.948, 2.205, 3.080, 3.619, 5.286, 6.333, 7.186], 'sales': [3.325, 3.619, 4.779, 13.986, 14.226, 15.499, 31.773, 31.040, 40.062]})
df = df.set_index(['STK_ID','RPT_Date'])

firsts = (df.groupby(level=['STK_ID']).transform('first'))
result = df / firsts

产量

                  net_pft     sales
STK_ID RPT_Date                    
002138 20140930  1.000000  1.000000
       20150930  1.215815  1.088421
       20160930  1.561779  1.437293
600004 20140930  1.000000  1.000000
       20150930  1.396825  1.017160
       20160930  1.641270  1.108180
600660 20140930  1.000000  1.000000
       20150930  1.198070  0.976930
       20160930  1.359440  1.260882

上面的主要技巧是使用groupby/transform('first')来创建DataFrame 其形状与df相同,但其值来自每组的第一行:

firsts = df.groupby(level=['STK_ID']).transform('first')
#                  net_pft   sales
# STK_ID RPT_Date                 
# 002138 20140930    0.607   3.325
#        20150930    0.607   3.325
#        20160930    0.607   3.325
# 600004 20140930    2.205  13.986
#        20150930    2.205  13.986
#        20160930    2.205  13.986
# 600660 20140930    5.286  31.773
#        20150930    5.286  31.773
#        20160930    5.286  31.773

虽然这是对内存的挥霍使用,但这可能是获得所需结果的最快方法,因为它避免了在Python中循环遍历组。

如果以上代码在Pandas版本0.13中引发TypeError: Transform function invalid for data types,您可以尝试使用此解决方法:

result = list()
for key, grp in df.groupby(level=['STK_ID']):
    result.append(grp/grp.iloc[0])
result = pd.concat(result)
print(result)