数据帧列“重命名”和“丢弃”的熊猫性能问题

时间:2013-10-03 14:40:28

标签: performance pandas

以下是函数的line_profiler记录:

Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s

File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1068                                           @profile
  1069                                           def _rpt_join(dfa, dfb, join_type='inner'):
  1070                                               ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
  1071                                                   'join_type' can be 'inner' or 'outer'
  1072                                               '''
  1073                                           
  1074         2           56     28.0      0.0      try:    # ('STK_ID','RPT_Date') are normal column
  1075         2      2936668 1468334.0     43.7          rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
  1076                                               except: # ('STK_ID','RPT_Date') are index
  1077                                                   rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
  1078                                                   
  1079                                           
  1080         2           81     40.5      0.0      try: # handle 'STK_Name
  1081         2       426472 213236.0      6.3          name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
  1082                                                   
  1083                                                   
  1084         2       900584 450292.0     13.4          nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
  1085                                                   
  1086         2      1138140 569070.0     16.9          rst.STK_Name_x = nameseries
  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)
  1089                                               except:
  1090                                                   pass
  1091                                           
  1092         2           94     47.0      0.0      return rst

这两条线让我感到惊讶:

  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)

为什么简单的数据框列"rename""drop"操作会花费很多时间(8.9%+ 10.7%)?无论如何,"merge"操作仅花费43.7%,并且“重命名”/“丢弃”看起来不像计算密集型操作。怎么改进呢?

0 个答案:

没有答案