首先,我意识到有很多关于效率的问题,所以我很抱歉,如果这是重复的,但我在这里是因为我找不到我想要的东西。我将用一个例子来问这个问题:
我有一些时间序列数据,我从excel导入到pandas数据帧中。它有id | name | date | total_monthly_rtn
的列,我正在进行一些简单的操作以用于报告目的(月,3M,YTD,ITD等)。然后我按id / name拆分这个数据帧,并将其存储在Account对象中:
import pandas as ps
from pandas.tseries.offsets import YearEnd, DateOffset
class Account(object):
def __init__(self, df):
self.name = df['name'][0]
self.id = df['id'][0]
self.rtn = ps.TimeSeries(df['total_monthly_rtn'].values, index=ps.DatetimeIndex(df['date'], freq='M'))
self.m = DateOffset(months=1)
# dataframe mentioned above with 4 columns (id;name;date;total_monthly_rtn)
data = parsexl(src_path) # parsexl not needed for this question
valdate = max(data['date'])
y = YearEnd()
# an example of one slice of 'data'
foo = Account(data[data['id'] == '123456'])
# where 'data[data['id'] == '123456']' looks like:
id name date total_monthly_rtn
0 123456 Bank of Foo 2011-07-31 00:00:00 -2.75
1 123456 Bank of Foo 2011-08-31 00:00:00 -7.63
2 123456 Bank of Foo 2011-09-30 00:00:00 -4.03
3 123456 Bank of Foo 2011-10-31 00:00:00 5.68
4 123456 Bank of Foo 2011-11-30 00:00:00 -1.79
5 123456 Bank of Foo 2011-12-31 00:00:00 0.93
6 123456 Bank of Foo 2012-01-31 00:00:00 3.0773
7 123456 Bank of Foo 2012-02-29 00:00:00 5.4896
8 123456 Bank of Foo 2012-03-31 00:00:00 0.5089
9 123456 Bank of Foo 2012-04-30 00:00:00 -2.0739
10 123456 Bank of Foo 2012-05-31 00:00:00 -6.0472
11 123456 Bank of Foo 2012-06-30 00:00:00 4.7578
12 123456 Bank of Foo 2012-07-31 00:00:00 2.1529
13 123456 Bank of Foo 2012-08-31 00:00:00 1.0867
14 123456 Bank of Foo 2012-09-30 00:00:00 0.3791
15 123456 Bank of Foo 2012-10-31 00:00:00 1.143
16 123456 Bank of Foo 2012-11-30 00:00:00 3.3823
17 123456 Bank of Foo 2012-12-31 00:00:00 0.6535
18 123456 Bank of Foo 2013-01-31 00:00:00 7.3905
19 123456 Bank of Foo 2013-02-28 00:00:00 3.5779
20 123456 Bank of Foo 2013-03-31 00:00:00 2.3466
21 123456 Bank of Foo 2013-04-30 00:00:00 1.6874
22 123456 Bank of Foo 2013-05-31 00:00:00 0.6536
23 123456 Bank of Foo 2013-06-30 00:00:00 -2.7618
24 123456 Bank of Foo 2013-07-31 00:00:00 3.854
25 123456 Bank of Foo 2013-08-31 00:00:00 -3.6812
26 123456 Bank of Foo 2013-09-30 00:00:00 1.9478
27 123456 Bank of Foo 2013-10-31 00:00:00 3.9654
我最初为Account编写了这两个类函数:
def ytd(self, ye, vd):
return self.rtn.truncate(before=ye.rollback(vd), after=vd)[1:].sum()
def year(self, vd):
return self.rtn.truncate(before=vd-(11*m), after=vd).sum()
# called like:
foo.ytd(y, valdate) # returns 18.9802
foo.year(valdate) # returns 23.016
但后来我开始思考,将valdate和YearEnd存储为类属性会更好吗?从而将这两个功能改为:
def ytd(self):
return self.rtn.truncate(before=self.ye.rollback(vd), after=self.vd)[1:].sum()
def year(self):
return self.rtn.truncate(before=self.vd-(11*m), after=self.vd).sum()
在我的应用程序中,我在data
处理约8,000行,代表100个帐户,所以也许不会有这样或那样的巨大影响,但总的来说呢?我的直觉告诉我,第一种方式更好,但如果有人知道他们的东西可以让我放心,我会很感激。谢谢。
==编辑==
我这里只包含了两个类函数,但如果它有所不同,实际上有10个类函数将valdate和YearEnd作为变量。
==编辑2 ==
如果我的例子让某些人感到困惑,我很抱歉。 如果你不知道:rtn = return; ytd =年初至今
答案 0 :(得分:0)
如果问题只是关于perfs(或更准确的速度),函数本地查找比属性查找更快,所以你当前的解决方案没问题。