Python pandas groupby无法正常工作

时间:2016-06-23 04:21:33

标签: python pandas group-by

我想在pr上做一个pandas groupby。股票代码组。在下面的代码中。为了计算不同的KPI,pr。股票清单中的股票代码。在这里,我只是展示了col'差异'从前一天开始。显然我不希望不同代码之间的区别 - 这没有意义 - 因此是groupby。但它没有按预期工作。

输出文件中出现问题 正如你在下面的输出中所看到的那样,实际的groupby并没有按照预期的那样做,即col'差异'超越并跨越groupby中的不同组(自动收报机)。因此,它计算第一组中最后一个自动收报行与第二组中第一个自动收报行之间的差异。这不是预期的。这一行应该是NaN作为第一行...

这是“差异”的结果。 col in df差异 日期

2015-04-09 NaN
2015-04-10 1.180000
2015-04-13 3.150000
2015-04-14 -0.980000
2015-04-15 1.280000
2015-04-16 -8.280000
2015-04-17 -8.770000
2015-04-09 -139.859995 This is not correct. The groupby does not separate the tickers as it should. This should be a NaN... not the diff between 2 different tickers!

2015-04-10 0.899994
2015-04-13 -1.130005
2015-04-14 -0.589996
2015-04-15 1.000000
2015-04-16 0.350006
2015-04-09 -139.859995

任何有关“差异”原因的想法。 col在我的代码中不应该像分组一样分开吗?

import pandas as pd
import time
from io import StringIO

text = """Date   Ticker        Open        High         Low   Adj_Close   Volume
    2015-04-09  vws.co  315.000000  316.100000  312.500000  311.520000  1686800
    2015-04-10  vws.co  317.000000  319.700000  316.400000  312.700000  1396500
    2015-04-13  vws.co  317.900000  321.500000  315.200000  315.850000  1564500
    2015-04-14  vws.co  320.000000  322.400000  318.700000  314.870000  1370600
    2015-04-15  vws.co  320.000000  321.500000  319.200000  316.150000   945000
    2015-04-16  vws.co  319.000000  320.200000  310.400000  307.870000  2236100
    2015-04-17  vws.co  309.900000  310.000000  302.500000  299.100000  2711900
    2015-04-20  vws.co  303.000000  312.000000  303.000000  306.490000  1629700
    2015-04-09     mmm  166.750000  167.500000  166.500000  166.630005  1762800
    2015-04-10     mmm  165.630005  167.740005  164.789993  167.529999  1993700
    2015-04-13     mmm  167.110001  167.490005  165.919998  166.399994  2022800
    2015-04-14     mmm  165.179993  166.550003  164.649994  165.809998  1610300
    2015-04-15     mmm  165.339996  167.080002  164.839996  166.809998  2092200
    2015-04-16     mmm  165.880005  167.229996  165.250000  167.160004  2721900"""

df = pd.read_csv(StringIO(text), delim_whitespace=1, parse_dates=[0], index_col=0)

def Screener(group):

    def diff_calc(group):

        df['Difference'] = df['Adj_Close'].diff()
        return df['Difference']

    df['Difference'] = diff_calc(group)
    return df

if __name__ == '__main__':

    ### groupby screeener (filtering to only rel ticker group)
    grouped = df.groupby('Ticker', as_index=False) # Now doing the groupby outside the iteration...

    for name, group in grouped:
        # Testing/showing the groups...
        print ('(group)\n',name,'\n')
        print ('(group (ticker) in df)\n',group.head(10),'\n')
        df = Screener(group)
        print(60 * '=')

    # Test the first 3 rows of each group for 'Difference' col transgress groups...
    df_test = df.groupby('Ticker').head(3).reset_index().set_index('Date')
    print ('df_test (summary from df) (Output)\n',df_test,'\n')

显然我的groupby按预期工作,但预期的差异' col在我的测试输出中表现得不正常:

(group)
 mmm 

(group (ticker) in df)
            Ticker        Open        High         Low   Adj_Close   Volume
Date                                                                      
2015-04-09    mmm  166.750000  167.500000  166.500000  166.630005  1762800
2015-04-10    mmm  165.630005  167.740005  164.789993  167.529999  1993700
2015-04-13    mmm  167.110001  167.490005  165.919998  166.399994  2022800
2015-04-14    mmm  165.179993  166.550003  164.649994  165.809998  1610300
2015-04-15    mmm  165.339996  167.080002  164.839996  166.809998  2092200
2015-04-16    mmm  165.880005  167.229996  165.250000  167.160004  2721900 

============================================================
(group)
 vws.co 

(group (ticker) in df)
             Ticker   Open   High    Low  Adj_Close   Volume
Date                                                       
2015-04-09  vws.co  315.0  316.1  312.5     311.52  1686800
2015-04-10  vws.co  317.0  319.7  316.4     312.70  1396500
2015-04-13  vws.co  317.9  321.5  315.2     315.85  1564500
2015-04-14  vws.co  320.0  322.4  318.7     314.87  1370600
2015-04-15  vws.co  320.0  321.5  319.2     316.15   945000
2015-04-16  vws.co  319.0  320.2  310.4     307.87  2236100
2015-04-17  vws.co  309.9  310.0  302.5     299.10  2711900
2015-04-20  vws.co  303.0  312.0  303.0     306.49  1629700 

============================================================
df_test (summary from df) (Output)
             Ticker        Open        High         Low   Adj_Close   Volume  Date                                                                          
2015-04-09  vws.co  315.000000  316.100000  312.500000  311.520000  1686800   
2015-04-10  vws.co  317.000000  319.700000  316.400000  312.700000  1396500   
2015-04-13  vws.co  317.900000  321.500000  315.200000  315.850000  1564500   
2015-04-09     mmm  166.750000  167.500000  166.500000  166.630005  1762800   
2015-04-10     mmm  165.630005  167.740005  164.789993  167.529999  1993700   
2015-04-13     mmm  167.110001  167.490005  165.919998  166.399994  2022800   

            Difference  
Date                    
2015-04-09         NaN  
2015-04-10    1.180000  
2015-04-13    3.150000  
2015-04-09 -139.859995  This is not correct!!! This should be NaN...
2015-04-10    0.899994  
2015-04-13   -1.130005 

1 个答案:

答案 0 :(得分:0)

在更仔细地审核您的代码后,Screener函数的范围是错误的。您在没有传递df参数的情况下在该函数中引用df。这意味着它正在函数外部范围内定义的df变量上运行,即主df。因此,您要分配整个df.diff()的{​​{1}},而不是分组的df

我发现它更容易:

df

enter image description here