加速数据帧迭代

时间:2014-12-15 14:17:30

标签: python-2.7 pandas

我尝试通过迭代对名为df1的数据帧运行mibian.BS函数,并将值赋给名为' Implied_Vola'的新列。 如何加快整个程序?处理具有3 Mio行的原始数据帧将占用我的机器9000分钟,这是太多了。 不幸的是mibian.BS没有采取vektor输入。因此必须对数据帧中的每一行进行迭代应用。

import mibian
import numpy
import time
mask=(df1['ask'] > 0) & (df1['bid'] > 0) & (df1['call put'] == 'C') & (df1['Restlaufzeit']>0)

for index, row in df1.loc[mask].iterrows() :
try:
    c = mibian.BS([row['unadjusted stock price'],row['strike'], row['Zins'], row['Restlaufzeit']], callPrice=row['mean'])
    mask2=((df1.index==index) & (df1['unadjusted stock price']==row['unadjusted stock price']) &  (df1['strike']==row['strike']) &  (df1['Zins']==row['Zins']) &  (df1['Restlaufzeit']==row['Restlaufzeit']) & (df1['mean']==row['mean'] ))
    df1.loc[mask2, 'Implied_Vola'] = c.impliedVolatility
except ZeroDivisionError, e:
    df1.loc[mask2,'Implied_Vola'] = numpy.nan

端=了time.time() 时间=(端开始)/ 60 打印时间,'分钟'

df1.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2 entries, 2002-05-16 00:00:00 to 2002-05-16 00:00:00
Data columns (total 13 columns):
adjusted stock close price    2 non-null float64
expiration                    2 non-null datetime64[ns]
strike                        2 non-null int64
call put                      2 non-null object
ask                           2 non-null float64
bid                           2 non-null float64
volume                        2 non-null int64
open interest                 2 non-null int64
unadjusted stock price        2 non-null float64
Restlaufzeit                  2 non-null int32
Zins                          2 non-null float64
mean                          2 non-null float64
Implied_Vola                  2 non-null float64
dtypes: datetime64[ns](1), float64(7), int32(1), int64(3), object(1)
memory usage: 216.0+ bytes

我重写了没有dataframe.iterrows()的循环:

import mibian
import numpy
import time
df2=df1.copy()
start = time.time()
mask=(df2['ask'] > 0) & (df2['bid'] > 0) & (df2['call put'] == 'C') & (df2['Restlaufzeit']>0)
vola=[]
for row in df2.loc[mask].values:
    try:
        c = mibian.BS([row[8],row[2], row[10], row[9]], callPrice=row[11])
        vola.append(c.impliedVolatility)
    except  ZeroDivisionError, e:
        vola.append(numpy.nan)
df2.loc[mask,'vola'] = vola
end=time.time()
time=(end-start)/60
print time, 'minutes'

然而,没有加速。这应该以某种方式完成不同吗?

1 个答案:

答案 0 :(得分:1)

循环遍历ndarray比使用df.iterrows()要快得多。

而不是

for index, row in df1.loc[mask].iterrows() :
    # DO STUFF with row Series

尝试使用

for index, row in enumerate(df1.loc[mask].values) :
    # DO STUFF with row tuple

你必须回到整数索引,但它要快得多。