Pandas dataframe.apply()误将值应用于dataframe列

时间:2015-10-15 01:01:35

标签: python pandas apply

我的代码使用dataframe.apply()调用函数。该函数使用pandas.Series返回多个值。但是,dataframe.apply()会将值应用于错误的列。

以下代码尝试返回dte,mark和iv。这些值在return语句之前打印出来以验证值。

import pandas as pd
from pandas import Timestamp
from pandas.tseries.holiday import get_calendar, HolidayCalendarFactory, GoodFriday
from datetime import datetime
from math import sqrt, pi, log, exp, isnan
from scipy.stats import norm


# dff = Daily Fed Funds Rate https://research.stlouisfed.org/fred2/data/DFF.csv
dff = pd.read_csv('https://research.stlouisfed.org/fred2/data/DFF.csv', parse_dates=[0], index_col='DATE')
rf = float('%.4f' % (dff['VALUE'][-1:][0] / 100))
tradingMinutesDay = 450                                 # 7.5 hours per day * 60 minutes per hour
tradingMinutesAnnum = 113400                            # trading minutes per day * 252 trading days per year
USFedCal = get_calendar('USFederalHolidayCalendar')     # Load US Federal holiday calendar
USFedCal.rules.pop(7)                                   # Remove Veteran's Day
USFedCal.rules.pop(6)                                   # Remove Columbus Day
tradingCal = HolidayCalendarFactory('TradingCalendar', USFedCal, GoodFriday)    # Add Good Friday
cal = tradingCal()


def newtonRap(row):
    # Initialize variables
    dte, mark, iv = 0.0, 0.0, 0.0
    if row['Bid'] == 0.0 or row['Ask'] == 0.0 or row['RootPrice'] == 0.0 or row['Strike'] == 0.0 or \
       row['TimeStamp'] == row['Expiry']:
        iv, vega = 0.0, 0.0         # Set iv and vega to zero if option contract is invalid or expired
    else:
        # dte (Days to expiration) uses pandas bdate_range method to determine the number of business days to expiration
        #   minus USFederalHolidays minus constant of 1 for the TimeStamp date
        dte = float(len(pd.bdate_range(row['TimeStamp'], row['Expiry'])) -
                    len(cal.holidays(row['TimeStamp'], row['Expiry']).to_pydatetime()) - 1)
        mark = (row['Bid'] + row['Ask']) / 2
        cp = 1 if row['OptType'] == 'C' else -1
        S = row['RootPrice']
        K = row['Strike']
        T = (dte * tradingMinutesDay) / tradingMinutesAnnum
        iv = sqrt(2 * pi / T) * mark / S        # Initialize IV (Brenner and Subrahmanyam 1988)
        vega = 0.0                              # Initialize vega
        for i in range(1, 100):
            d1 = (log(S / K) + T * (rf + iv ** 2 / 2)) / (iv * sqrt(T))
            d2 = d1 - iv * sqrt(T)
            vega = S * norm.pdf(d1) * sqrt(T)
            model = cp * S * norm.cdf(cp * d1) - cp * K * exp(-rf * T) * norm.cdf(cp * d2)
            iv -= (model - mark) / vega
            if abs(model - mark) < 1.0e-5:
                break
        if isnan(iv) or isnan(vega):
            iv, vega = 0.0, 0.0
    print 'DTE', dte, 'Mark', mark, 'newtRaphIV', iv
    return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})


if __name__ == "__main__":
    # sample  data
    col_order = ['TimeStamp', 'OpraSymbol', 'RootSymbol', 'Expiry', 'Strike', 'OptType', 'RootPrice', 'Last', 'Bid', 'Ask', 'Volume', 'OpenInt', 'IV']
    df = pd.DataFrame({'Ask': {0: 3.7000000000000002, 1: 2.4199999999999999, 2: 3.0, 3: 2.7999999999999998, 4: 2.4500000000000002, 5: 3.25, 6: 5.9500000000000002, 7: 6.2999999999999998},
                       'Bid': {0: 3.6000000000000001, 1: 2.3399999999999999, 2: 2.8599999999999999, 3: 2.7400000000000002, 4: 2.4399999999999999, 5: 3.1000000000000001, 6: 5.7000000000000002, 7: 6.0999999999999996},
                       'Expiry': {0: Timestamp('2015-10-16 16:00:00'), 1: Timestamp('2015-10-16 16:00:00'), 2: Timestamp('2015-10-16 16:00:00'), 3: Timestamp('2015-10-16 16:00:00'), 4: Timestamp('2015-10-16 16:00:00'), 5: Timestamp('2015-10-16 16:00:00'), 6: Timestamp('2015-11-20 16:00:00'), 7: Timestamp('2015-11-20 16:00:00')},
                       'IV': {0: 0.3497, 1: 0.3146, 2: 0.3288, 3: 0.3029, 4: 0.3187, 5: 0.2926, 6: 0.3635, 7: 0.3842},
                       'Last': {0: 3.46, 1: 2.34, 2: 3.0, 3: 2.81, 4: 2.35, 5: 3.20, 6: 5.90, 7: 6.15},
                       'OpenInt': {0: 1290.0, 1: 3087.0, 2: 28850.0, 3: 44427.0, 4: 2318.0, 5: 3773.0, 6: 17112.0, 7: 15704.0},
                       'OpraSymbol': {0: 'AAPL151016C00109000', 1: 'AAPL151016P00109000', 2: 'AAPL151016C00110000', 3: 'AAPL151016P00110000', 4: 'AAPL151016C00111000', 5: 'AAPL151016P00111000', 6: 'AAPL151120C00110000', 7: 'AAPL151120P00110000'},
                       'OptType': {0: 'C', 1: 'P', 2: 'C', 3: 'P', 4: 'C', 5: 'P', 6: 'C', 7: 'P'},
                       'RootPrice': {0: 109.95, 1: 109.95, 2: 109.95, 3: 109.95, 4: 109.95, 5: 109.95, 6: 109.95, 7: 109.95},
                       'RootSymbol': {0: 'AAPL', 1: 'AAPL', 2: 'AAPL', 3: 'AAPL', 4: 'AAPL', 5: 'AAPL', 6: 'AAPL', 7: 'AAPL'},
                       'Strike': {0: 109.0, 1: 109.0, 2: 110.0, 3: 110.0, 4: 111.0, 5: 111.0, 6: 110.0, 7: 110.0},
                       'TimeStamp': {0: Timestamp('2015-09-30 16:00:00'), 1: Timestamp('2015-09-30 16:00:00'), 2: Timestamp('2015-09-30 16:00:00'), 3: Timestamp('2015-09-30 16:00:00'), 4: Timestamp('2015-09-30 16:00:00'), 5: Timestamp('2015-09-30 16:00:00'), 6: Timestamp('2015-09-30 16:00:00'), 7: Timestamp('2015-09-30 16:00:00')},
                       'Volume': {0: 1565.0, 1: 3790.0, 2: 10217.0, 3: 12113.0, 4: 6674.0, 5: 2031.0, 6: 5330.0, 7: 3724.0}})
    df = df[col_order]


    df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)
    print df[['DTE', 'Mark', 'newtRaphIV']]

当我打印dte,mark和iv的数据框列时,iv的值将应用于标记列,并且标记的值将应用于iv列。

见下面的输出:

DTE 12.0 Mark 3.65 newtRaphIV 0.330446529117
DTE 12.0 Mark 2.38 newtRaphIV 0.297287843836
DTE 12.0 Mark 2.93 newtRaphIV 0.308354580411
DTE 12.0 Mark 2.77 newtRaphIV 0.287119199001
DTE 12.0 Mark 2.445 newtRaphIV 0.305461340472
DTE 12.0 Mark 3.175 newtRaphIV 0.272517270403
DTE 37.0 Mark 5.825 newtRaphIV 0.347642501561
DTE 37.0 Mark 6.2 newtRaphIV 0.368273860485
   DTE      Mark  newtRaphIV
0   12  0.330447       3.650
1   12  0.297288       2.380
2   12  0.308355       2.930
3   12  0.287119       2.770
4   12  0.305461       2.445
5   12  0.272517       3.175
6   37  0.347643       5.825
7   37  0.368274       6.200

这不是我预期的行为。发生了什么事?

1 个答案:

答案 0 :(得分:4)

df.apply(newtonRap, axis=1)

是一个包含['DTE', 'Mark', 'IV']列的DataFrame,但不保证列的顺序(请参阅下面的原因)。因此,要修复DataFrame列的顺序,您可以 修复newtonRap返回的Series索引的顺序:

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])

或修复df.apply返回后列的顺序:

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]

第一个选项更好,因为

df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']]

创建两个中间数据框架 - df.apply(newtonRap, axis=1)df.apply(newtonRap, axis=1)[['DTE', 'Mark', 'IV']],而第一个选项从一开始就创建了正确的DataFrame。

DataFrame分配在索引上对齐但不在列上对齐:

请注意分配表格

df[['C','E','D']] = other_df

基于 index 而不是列名称对齐。因此df.apply(newtonRap, axis=1)的列名称无关紧要。例如,改变

无济于事
return pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

return pd.Series({'DTE': dte, 'Mark': mark, 'newtRaphIV': iv})

使df.apply(newtonRap, axis=1)的列名与。{1}}的列名相匹配 df[['DTE', 'Mark', 'newtRaphIV']]。如果确实如此,那就是运气不好 df.apply(newtonRap, axis=1) 返回的列的顺序发生以匹配所需的顺序。为了证实这一说法,请考虑示例

df = pd.DataFrame(np.random.randint(10, size=(3,2)), columns=list('AB'))
new = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list('CDE'), index=[2,1,0])
#    C  D  E
# 2  0  1  2
# 1  3  4  5
# 0  6  7  8

df[['C','E','D']] = new
#    A  B  C  E  D
# 0  7  9  6  7  8
# 1  4  9  3  4  5
# 2  8  2  0  1  2

请注意,newdf的索引已对齐,但基于列标签没有对齐。

修复apply返回的DataFrame列的顺序:

请注意,dict键是无序的。换句话说,当迭代时,dict键可以以任何顺序出现。实际上,在Python3中,每次运行相同的代码时,dict.keys()都可能以不同的顺序返回相同的键。

因为dict键具有不确定的顺序,

pd.Series({'DTE': dte, 'Mark': mark, 'IV': iv})

是一个系列,其索引具有不确定的顺序,因此df.apply(newtonRap, axis=1)是一个DataFrame,其列以不确定的顺序显示。

但是,如果你使用

return pd.Series((dte, mark, iv), index=['DTE','Mark','IV'])

然后系列索引的顺序是固定的。因此df.apply(newtonRap, axis=1)具有固定的列顺序,然后

df[['DTE', 'Mark', 'newtRaphIV']] = df.apply(newtonRap, axis=1)

将按预期工作。