在python中滚动数据透视表

时间:2018-03-23 13:57:27

标签: python pandas pandas-groupby

我有一个df(样本最后粘贴在这里)。 我希望找到哪个tradePrice每过去10分钟最多tradeVolume,或任何其他滚动期。

这是在xls中完成的数据透视表,它基于附加的示例数据。

                                      minute                
tradePrice  Data                   0    1   10  Total Result    
12548       Sum - tradeVolume               3   3   
            Count - tradePrice              2   2   
12548.5     Sum - tradeVolume               1   1   
            Count - tradePrice              1   1   
12549       Sum - tradeVolume               1   1   
            Count - tradePrice              1   1   
12549.5     Sum - tradeVolume               95  95  
            Count - tradePrice              5   5   
12550       Sum - tradeVolume               6   6   
            Count - tradePrice              4   4   
12559       Sum - tradeVolume       93          93  
            Count - tradePrice      1           1   
12559.5     Sum - tradeVolume           1       1   
            Count - tradePrice          1       1   
12560       Sum - tradeVolume           5       5   
            Count - tradePrice          4       4   
12560.5     Sum - tradeVolume       3   5       8   
            Count - tradePrice     3    2       5   
12561       Sum - tradeVolume      4    5       9   
            Count - tradePrice     2    3       5   
12561.5     Sum - tradeVolume      3    2       5   
            Count - tradePrice     3    1       4   
12562       Sum - tradeVolume      9    7       16  
            Count - tradePrice     8    1       9   
12562.5     Sum - tradeVolume       6   2       8   
            Count - tradePrice     2    2       4   
12563       Sum - tradeVolume      2            2   
            Count - tradePrice     1            1   
Total Sum - tradeVolume     120 27  106 253 
Total Count - tradePrice    20  14  13  47  

结果需要像这样搜索交易量最大的价格:

    Price   Volume
02:00:00 AM 12559   93
02:01:00 AM 12562   7
02:10:00 AM 12549.5 95

为了获得1分钟。结果我小组和&应用了以下功能

def f(x):  # function to find the POC price and volume
    a = x['tradePrice'].value_counts().index[0]
    b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
    return pd.Series([a, b], ['POC_Price', 'POC_Volume'])

groupbytime = (str(Time)+"min")#ther is a column name by this
groups = df.groupby(groupbytime,as_index=True)
df_POC = groups.apply(f)  #applys the function of the POC on the grouped data                                                                  

我的问题是: 如何获得相同的解决方案但是每<滚动时间段(不能少于1分钟),所以过去10分钟的预期结果(以最大交易量交易的价格)是:

    Price   Volume
02:10:00 AM 12549.5 95

提前感谢!

样本数据:

    dateTime    tradePrice  tradeVolume 1min    time_of_day_10  time_of_day_30  date    hour    minute
0   2017-09-19 02:00:04 12559   93  2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
49  2017-09-19 02:00:11 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
50  2017-09-19 02:00:12 12563   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
51  2017-09-19 02:00:12 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
122 2017-09-19 02:00:34 12561.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
123 2017-09-19 02:00:34 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
127 2017-09-19 02:00:34 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
129 2017-09-19 02:00:35 12561   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
130 2017-09-19 02:00:35 12560.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
131 2017-09-19 02:00:35 12561.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
135 2017-09-19 02:00:39 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
136 2017-09-19 02:00:39 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
137 2017-09-19 02:00:43 12561.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
138 2017-09-19 02:00:43 12561   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
139 2017-09-19 02:00:43 12560.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
140 2017-09-19 02:00:43 12560.5 1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
152 2017-09-19 02:00:45 12562   2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
153 2017-09-19 02:00:46 12562.5 4   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
166 2017-09-19 02:00:58 12562   1   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
167 2017-09-19 02:00:58 12562.5 2   2017-09-19 02:00:00 02:00:00    02:00:00    2017-09-19  2   0
168 2017-09-19 02:01:00 12562   7   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
169 2017-09-19 02:01:00 12562.5 1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
170 2017-09-19 02:01:00 12562.5 1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
171 2017-09-19 02:01:00 12561.5 2   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
175 2017-09-19 02:01:03 12561   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
176 2017-09-19 02:01:03 12561   3   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
187 2017-09-19 02:01:07 12560.5 2   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
188 2017-09-19 02:01:08 12561   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
189 2017-09-19 02:01:10 12560   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
190 2017-09-19 02:01:10 12560   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
191 2017-09-19 02:01:10 12559.5 1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
192 2017-09-19 02:01:11 12560   1   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
193 2017-09-19 02:01:12 12560   2   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
194 2017-09-19 02:01:12 12560.5 3   2017-09-19 02:01:00 02:00:00    02:00:00    2017-09-19  2   1
593 2017-09-19 02:10:00 12550   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
594 2017-09-19 02:10:00 12549.5 12  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
604 2017-09-19 02:10:12 12548.5 1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
605 2017-09-19 02:10:15 12549.5 22  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
606 2017-09-19 02:10:16 12549.5 21  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
636 2017-09-19 02:10:45 12548   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
637 2017-09-19 02:10:47 12548   2   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
638 2017-09-19 02:10:47 12549.5 23  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
639 2017-09-19 02:10:48 12549.5 17  2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
640 2017-09-19 02:10:49 12549   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
665 2017-09-19 02:10:58 12550   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
666 2017-09-19 02:10:58 12550   1   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10
667 2017-09-19 02:10:58 12550   3   2017-09-19 02:10:00 02:10:00    02:00:00    2017-09-19  2   10

1 个答案:

答案 0 :(得分:3)

如果我理解你的问题,你需要选择时间粒度和时间窗口。然后你可以结合使用groupby + unstack + rolling。

首先是groupby:

time_grain = '1min'
df = df.groupby([pd.Grouper(key='dateTime', freq=time_grain),'tradePrice']).tradeVolume.sum()
dateTime             tradePrice
2017-09-19 02:00:00  12559.0       93
                     12560.5        3
                     12561.0        4
                     12561.5        3
                     12562.0        9
                     12562.5        6
                     12563.0        2
2017-09-19 02:01:00  12559.5        1
                     12560.0        5
                     12560.5        5
                     12561.0        5
                     12561.5        2
                     12562.0        7
                     12562.5        2
2017-09-19 02:10:00  12548.0        3
                     12548.5        1
                     12549.0        1
                     12549.5       95
                     12550.0        6
Name: tradeVolume, dtype: int64

然后取消堆叠+滚动:

window_size = '10min'
df = df.unstack('tradePrice').rolling(window_size).sum()
tradePrice           12548.0  12548.5  12549.0  12549.5  12550.0  12559.0  \
dateTime                                                                    
2017-09-19 02:00:00      NaN      NaN      NaN      NaN      NaN     93.0   
2017-09-19 02:01:00      NaN      NaN      NaN      NaN      NaN     93.0   
2017-09-19 02:10:00      3.0      1.0      1.0     95.0      6.0      NaN   

tradePrice           12559.5  12560.0  12560.5  12561.0  12561.5  12562.0  \
dateTime                                                                    
2017-09-19 02:00:00      NaN      NaN      3.0      4.0      3.0      9.0   
2017-09-19 02:01:00      1.0      5.0      8.0      9.0      5.0     16.0   
2017-09-19 02:10:00      1.0      5.0      5.0      5.0      2.0      7.0   

tradePrice           12562.5  12563.0  
dateTime                               
2017-09-19 02:00:00      6.0      2.0  
2017-09-19 02:01:00      8.0      2.0  
2017-09-19 02:10:00      2.0      NaN

最后将tradePrice堆叠回索引并找到每个时间段内具有最高值的索引:

df = df.stack('tradePrice')
idx_list = df.groupby('dateTime').idxmax()
result = df.loc[idx_list]
dateTime             tradePrice
2017-09-19 02:00:00  12559.0       93.0
2017-09-19 02:01:00  12559.0       93.0
2017-09-19 02:10:00  12549.5       95.0
dtype: float64

请注意,如果您使用时间偏移量,则滚动默认值为最小观测值1。这就是为什么你得到3个结果行。

我认为这种方法的最大缺点是,对于具有大量价格点的大型数据帧,这将占用大量内存(因为每个价格点都会生成一个新列)。