我正在尝试从勾选列表到OHLC非时间序列(范围栏,音量条等)的resample / groupby。
原始数据DF:
symbol utime time price vol cc cv ttype \
id
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 120 120 R
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 10 735 120 R
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 20 735 120 R
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 30 735 3 R
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 735 147 R
我需要" groupby"行直到vol列的总和< = [constant]例如500,当达到500时,再次开始求和......
伪:
vol_amount = 500
adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last' }
ohlc_vol = data.groupby(df['vol'].cumsum() <= vol_amount)['price'].agg(adict)
ohlc['ticks_count'] = data.groupby(df['vol'].cumsum() <= vol_amount)['vol'].count()
感谢您的帮助!
答案 0 :(得分:1)
考虑使用双正斜杠//
的整数除法运算符,将 volume 的累积和除以 vol_amount 的倍数。然后在 price 聚合中使用该分组:
vol_amount = 100
data['volcum'] = data['vol'].cumsum()
data['volcumgrp'] = data['volcum'] - ((data['volcum'] // vol_amount) * vol_amount)
adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last'}
ohlc_vol = data.groupby(['volcumgrp'])['price'].agg(adict)
ohlc_vol['ticks_count'] = data.groupby(['volcumgrp'])['vol'].count()
使用重复发布的数据帧堆栈来演示数据:
from io import StringIO
import pandas as pd
text = '''
id symbol utime time price vol cc cv ttype
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 120 120 R
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 10 735 120 R
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 20 735 120 R
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 30 735 3 R
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 735 147 R
'''
data = pd.concat([pd.read_table(StringIO(text), sep="\s+"),
pd.read_table(StringIO(text), sep="\s+"),
pd.read_table(StringIO(text), sep="\s+"),
pd.read_table(StringIO(text), sep="\s+"),
pd.read_table(StringIO(text), sep="\s+")])
# RANDOMIZE PRICE FOR DEMO
from random import randint, seed
seed(a=48)
data['price'] = [float(randint(3175,3199)) for i in range(25)]
# VOLUME CUMULATIVE GROUP
vol_amount = 100
data['volcum'] = data['vol'].cumsum()
data['volcumgrp'] = data['volcum'] - ((data['volcum'] // vol_amount) * vol_amount)
print(data)
# PRICE AGGREGATION
adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last'}
ohlc_vol = data.groupby(['volcumgrp'])['price'].agg(adict)
ohlc_vol['ticks_count'] = data.groupby(['volcumgrp'])['vol'].count()
print(ohlc_vol)
输出
数据df (每100个分组中)
id symbol utime time price vol cc cv ttype volcum volcumgrp
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 5 120 120 R 5 5
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3185.0 10 735 120 R 15 15
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 20 735 120 R 35 35
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 30 735 3 R 65 65
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3197.0 5 735 147 R 70 70
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 5 120 120 R 75 75
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3184.0 10 735 120 R 85 85
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3191.0 20 735 120 R 105 5
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3181.0 30 735 3 R 135 35
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3197.0 5 735 147 R 140 40
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3199.0 5 120 120 R 145 45
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3188.0 10 735 120 R 155 55
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3180.0 20 735 120 R 175 75
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 30 735 3 R 205 5
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3196.0 5 735 147 R 210 10
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3178.0 5 120 120 R 215 15
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3190.0 10 735 120 R 225 25
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3195.0 20 735 120 R 245 45
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3182.0 30 735 3 R 275 75
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3181.0 5 735 147 R 280 80
1 DOLX16 1.476961e+09 2016-10-20 09:00:37 3199.0 5 120 120 R 285 85
2 DOLX16 1.476961e+09 2016-10-20 09:00:37 3191.0 10 735 120 R 295 95
3 DOLX16 1.476961e+09 2016-10-20 09:00:37 3192.0 20 735 120 R 315 15
4 DOLX16 1.476961e+09 2016-10-20 09:00:37 3191.0 30 735 3 R 345 45
5 DOLX16 1.476961e+09 2016-10-20 09:00:37 3179.0 5 735 147 R 350 50
ohlc_vol df
open low high close ticks_count
volcumgrp
5 3192.0 3179.0 3192.0 3179.0 3
10 3196.0 3196.0 3196.0 3196.0 1
15 3185.0 3178.0 3192.0 3192.0 3
25 3190.0 3190.0 3190.0 3190.0 1
35 3179.0 3179.0 3181.0 3181.0 2
40 3197.0 3197.0 3197.0 3197.0 1
45 3199.0 3191.0 3199.0 3191.0 3
50 3179.0 3179.0 3179.0 3179.0 1
55 3188.0 3188.0 3188.0 3188.0 1
65 3192.0 3192.0 3192.0 3192.0 1
70 3197.0 3197.0 3197.0 3197.0 1
75 3192.0 3180.0 3192.0 3182.0 3
80 3181.0 3181.0 3181.0 3181.0 1
85 3184.0 3184.0 3199.0 3199.0 2
95 3191.0 3191.0 3191.0 3191.0 1
答案 1 :(得分:0)
ol&#39; &#34; cumsum&#34;特技:
import pandas as pd
def grouper(df, threshold=500):
df['new_bin'] = 0
cum_vol = 0
for i in df.index:
if cum_vol >= threshold:
df.loc[i, 'new_bin'] = 1
cum_vol = 0
cum_vol += df.loc[i, "vol"]
df['group'] = df['new_bin'].cumsum()
return df.drop("new_bin", axis=1)
df = pd.DataFrame({"price" : [1, 2, 3, 2, 4, 5, 6, 7, 6],
"vol": [100, 300, 101, 100, 402, 103, 300, 100, 30]})
df = grouper(df)
print df
price vol group
0 1 100 0
1 2 300 0
2 3 101 0
3 2 100 1
4 4 402 1
5 5 103 2
6 6 300 2
7 7 100 2
8 6 30 3
adict = {'open': 'first', 'high':'max', 'low':'min', 'close' : 'last' }
print df.groupby("group")['price'].agg(adict)
high close open low
group
0 3 3 1 1
1 4 4 2 2
2 7 7 5 5
3 6 6 6 6
print df.groupby("group")['vol'].sum()
group
0 501
1 502
2 503
3 30