对熊猫进行分组以按SQL分组

时间:2020-02-04 20:44:38

标签: python sql pandas postgresql

我有一个数据框,当前正在使用python进行大部分计算,但是考虑到上亿行,SQL会更快。我下面的代码按到期日和到期日分组看跌和看涨期权。我们看每个组的最高中间价格并取差价。之后,我们寻找最小的差异。 Python代码如下所示:

df['Price'] = (df['Bid'].values + df['Ask'].values) / 2
df['Maturity'] = (df['Expiration'] - df['DataDate']).dt.days / 365

df输出:

       UnderlyingSymbol  UnderlyingPrice  Type Expiration   DataDate  Strike  \
686098              SPY       289.839996  call 2018-09-04 2018-09-04   150.0   
686100              SPY       289.839996  call 2018-09-04 2018-09-04   155.0   
686102              SPY       289.839996  call 2018-09-04 2018-09-04   160.0   
686104              SPY       289.839996  call 2018-09-04 2018-09-04   165.0   
686106              SPY       289.839996  call 2018-09-04 2018-09-04   170.0   
                ...              ...   ...        ...        ...     ...   
691381              SPY       289.839996   put 2020-12-18 2018-09-04   400.0   
691382              SPY       289.839996  call 2020-12-18 2018-09-04   405.0   
691383              SPY       289.839996   put 2020-12-18 2018-09-04   405.0   
691384              SPY       289.839996  call 2020-12-18 2018-09-04   410.0   
691385              SPY       289.839996   put 2020-12-18 2018-09-04   410.0   

              Last         Bid         Ask       Price  Maturity  
686098  136.710007  139.860001  140.119995  139.989990  0.000000  
686100  132.520004  134.850006  135.119995  134.985001  0.000000  
686102  127.519997  129.860001  130.119995  129.989990  0.000000  
686104  120.349998  124.779999  125.220001  125.000000  0.000000  
686106  115.389999  119.779999  120.220001  120.000000  0.000000  
           ...         ...         ...         ...       ...  
691381  128.729996  110.260002  111.660004  110.960007  2.290411  
691382    0.850000    0.740000    0.900000    0.820000  2.290411  
691383  134.089996  115.239998  116.190002  115.714996  2.290411  
691384    0.690000    0.640000    0.800000    0.720000  2.290411  
691385  128.550003  120.230003  121.639999  120.934998  2.290411  

在此之后,我们根据到期日和行使价进行分组,并查看每组的最高中间价格。

c = df[df.Type == 'call'].groupby(['Expiration','Strike'])['Price'].first()
p = df[df.Type == 'put'].groupby(['Expiration','Strike'])['Price'].first()
df = df.join((c - p).rename('CP_diff'), on=['Expiration','Strike'])

df = df[~df.CP_diff.isna()]
df['Forward'] = df['CP_diff'].values + df['Strike']

c输出:

 Expiration  Strike
2018-09-04  150.0     139.989990
            155.0     134.985001
            160.0     129.989990
            165.0     125.000000
            170.0     120.000000

2020-12-18  390.0       1.290000
            395.0       1.095000
            400.0       0.965000
            405.0       0.820000
            410.0       0.720000

此后,我们将每个组的价格差异最小化,并相应地更新数据框

minimum_difference = df.loc[df.groupby("Expiration")["CP_diff"].idxmin().values]
minimum_difference = minimum_difference[['Forward', 'Expiration']].set_index("Expiration")
df = df.set_index("Expiration")
df.update(minimum_difference)

最小差异输出:

Forward
Expiration            
2018-09-04  289.975006
2018-09-05  289.980011
2018-09-07  289.989990
2018-09-10  289.984985
2018-09-12  289.984985
2018-09-14  289.984985
2018-09-17  289.984985

最后是df输出:

           UnderlyingSymbol  UnderlyingPrice  Type   DataDate  Strike  \
Expiration                                                              
2018-09-04              SPY       289.839996  call 2018-09-04   290.0   
2018-09-04              SPY       289.839996   put 2018-09-04   290.0   
2018-09-05              SPY       289.839996  call 2018-09-04   270.0   
2018-09-05              SPY       289.839996   put 2018-09-04   270.0   
2018-09-05              SPY       289.839996  call 2018-09-04   270.5   
                    ...              ...   ...        ...     ...   
2020-12-18              SPY       289.839996   put 2018-09-04   400.0   
2020-12-18              SPY       289.839996  call 2018-09-04   405.0   
2020-12-18              SPY       289.839996   put 2018-09-04   405.0   
2020-12-18              SPY       289.839996  call 2018-09-04   410.0   
2020-12-18              SPY       289.839996   put 2018-09-04   410.0   

                  Last         Bid         Ask       Price  Maturity  \
Expiration                                                             
2018-09-04    0.040000    0.030000    0.040000    0.035000  0.000000   
2018-09-04    0.050000    0.050000    0.070000    0.060000  0.000000   
2018-09-05    0.000000   19.910000   20.080000   19.994999  0.002740   
2018-09-05    0.010000    0.010000    0.020000    0.015000  0.002740   
2018-09-05   19.090000   19.410000   19.580000   19.494999  0.002740   
               ...         ...         ...         ...       ...   
2020-12-18  128.729996  110.260002  111.660004  110.960007  2.290411   
2020-12-18    0.850000    0.740000    0.900000    0.820000  2.290411   
2020-12-18  134.089996  115.239998  116.190002  115.714996  2.290411   
2020-12-18    0.690000    0.640000    0.800000    0.720000  2.290411   
2020-12-18  128.550003  120.230003  121.639999  120.934998  2.290411   

               Forward  
Expiration              
2018-09-04  289.975006  
2018-09-04  289.975006  
2018-09-05  289.980011  
2018-09-05  289.980011  
2018-09-05  289.980011

如何使用SQL实现相同的目标?我的尝试如下:

WITH summary AS (
    SELECT df.datadate,df.expiration,df.type,(df.ask+df.bid)/2 as mid, 
        df.type,
           ROW_NUMBER() OVER(PARTITION BY df.expiration, df.type, df.datadate
                                 ORDER BY (df.ask+od.bid)/2 DESC) AS rk
      FROM option_data df
where od.Underlyingsymbol = 'SPY')

SELECT  s.*
FROM summary s
WHERE s.rk = 1

我是对的,可以通过上述查询计算出python变量p和c吗?这将是我的第一个更复杂的SQL查询,我正在尝试获取一些可以建立的示例。

0 个答案:

没有答案