Pandas对OHLC数据进行重采样,从1分钟到1H

时间:2017-08-01 03:09:06

标签: pandas dataframe resampling

我在Pandas中使用OHLC重新采样1分钟时间序列数据,15分钟将完美地工作,例如在以下数据帧上:

ohlc_dict = {'Open':'first', 'High':'max', 'Low':'min', 'Close': 'last'}
df.resample('15Min').apply(ohlc_dict).dropna(how='any').loc['2011-02-01']

Date Time             Open         High        Low        Close
------------------------------------------------------------------          
2011-02-01 09:30:00 3081.940    3086.860    3077.832    3081.214

2011-02-01 09:45:00 3082.422    3083.730    3071.922    3073.801

2011-02-01 10:00:00 3073.303    3078.345    3069.130    3078.345

2011-02-01 10:15:00 3078.563    3078.563    3071.522    3072.279

2011-02-01 10:30:00 3071.873    3071.873    3063.497    3067.364

2011-02-01 10:45:00 3066.735    3070.523    3063.402    3069.974

2011-02-01 11:00:00 3069.561    3069.981    3066.286    3069.981

2011-02-01 11:15:00 3070.602    3074.088    3070.373    3073.919

2011-02-01 13:00:00 3074.778    3074.823    3069.925    3069.925

2011-02-01 13:15:00 3070.096    3070.903    3063.457    3063.457

2011-02-01 13:30:00 3063.929    3067.358    3063.929    3067.358

2011-02-01 13:45:00 3067.570    3072.455    3067.570    3072.247

2011-02-01 14:00:00 3072.927    3081.357    3072.767    3080.175

2011-02-01 14:15:00 3078.843    3079.435    3076.733    3076.782

2011-02-01 14:30:00 3076.721    3081.980    3076.721    3081.912

2011-02-01 14:45:00 3082.822    3083.381    3076.722    3077.283

然而,当我重新采样1分钟到1H时,问题出现了。我使用默认设置,并从上午9点开始查找时间,但市场营业时间为上午9:30。

df.resample('1H').apply(ohlc_dict).dropna(how='any').loc['2011-02-01']

1HourOHLC Wrong in Morning

然后我尝试更改base设置,但在下午会话中失败。市场应该在下午13点开放,并在下午15点结束,所以应该是晚上13点,下午14点,下午15点,总共3个吧。

df.resample('60MIN',base=30).apply(ohlc_dict).dropna(how='any').loc['2011-02-01']

1HourOHLC Wrong in afternoon

总之,问题是我希望它适合市场并有6个(9:30,10:30,11:30,1:00,2:00,3:00)条,但resample中的pandas只给我5个条(9:30,10:30,11:30,1:30,2:30)

我在网上搜索了很长时间。但没用。请帮助或尝试提供一些如何实现这一点的想法。 感谢。

2 个答案:

答案 0 :(得分:0)

以下是数据框中仅Close的答案的一部分。 耶利说,resample中的pandas可能无法满足我的初衷。 因此,我尝试通过iterrows提取所需的项目。

from datetime import datetime
from datetime import timedelta

def extract(df):
    data = pd.DataFrame()
    for index, row in df.iterrows():
        if index.to_pydatetime().minute == 30 and index.to_pydatetime().hour < 12 :
            data = data.append(row)
        elif index.to_pydatetime().minute == 0 and index.to_pydatetime().hour > 12 :
            data = data.append(row)
        elif index.to_pydatetime().minute == 29 and index.to_pydatetime().hour == 11 :
            row = row = row.rename(index.to_pydatetime() + timedelta(minutes = 1))
            data = data.append(row)
        elif index.to_pydatetime().minute == 59 and index.to_pydatetime().hour == 14 :
            row = row = row.rename(index.to_pydatetime() + timedelta(minutes = 1))
            data = data.append(row)
    return data

data = extract(df.loc['2011-02-01'])
data

但是,除close外,其他项目不正确。 结果如下所示:

Close                             High        Low         Open        Volume       turnover
2011-02-01 09:30:00 3081.940    3081.940    3081.940    3081.940    74767100.0  996328900.0
2011-02-01 10:30:00 3071.873    3071.873    3071.873    3071.873    18754100.0  250694100.0
2011-02-01 11:30:00 3073.919    3073.919    3073.919    3073.919    13762700.0  179169200.0
2011-02-01 13:00:00 3074.778    3074.778    3074.778    3074.778    25992700.0  321678500.0
2011-02-01 14:00:00 3072.927    3072.927    3072.927    3072.927    11682300.0  161534600.0
2011-02-01 15:00:00 3077.283    3077.283    3077.283    3077.283    68184500.0  930561900.0

答案 1 :(得分:0)

我遇到了同样的问题,无法在线找到帮助。所以我写了这个脚本,将1分钟的OHLC数据转换为1小时。

这假设市场时间为上午9:15至下午3:30。如果市场时机不同,只需编辑start_time和end_time以适合您的需求。

我没有放任何其他支票,以防在市场交易时间内交易被暂停。

希望代码对某人有帮助。 :)

csv格式示例

Date,O,H,L,C,V
2020-03-12 09:15:00,3860,3867.8,3763.35,3830,58630
2020-03-12 09:16:00,3840.05,3859.4,3809.65,3834.6,67155
2020-03-12 09:17:00,3832.55,3855.4,3823.75,3852,51891
2020-03-12 09:18:00,3851.65,3860.95,3846.35,3859,42205
2020-03-12 09:19:00,3859.45,3860,3848.1,3851.55,33194

代码

from pandas import read_csv, to_datetime, DataFrame
from datetime import time

file_path = 'BAJFINANCE-EQ.csv'


def add(data, b):
    # utility function
    # appends the value in dictionary 'b'
    # to corresponding key in dictionary 'data'
    for (key, value) in b.items():
        data[key].append(value)


df = read_csv(file_path,
              parse_dates=True,
              infer_datetime_format=True,
              na_filter=False)

df['Date'] = to_datetime(df['Date'], format='%Y-%m-%d %H:%M:%S')

# stores hourly data to convert to dataframe
data = {
    'Date': [],
    'O': [],
    'H': [],
    'L': [],
    'C': [],
    'V': []
}

start_time = [time(9, 15), time(10, 15), time(11, 15), time(
    12, 15), time(13, 15), time(14, 15), time(15, 15)]

end_time = [time(10, 14), time(11, 14), time(12, 14), time(
    13, 14), time(14, 14), time(15, 14), time(15, 29)]


# Market timings 9:15am to 3:30pm (6 hours 15 mins)
# We create 6 hourly bars and one 15 min bar
# as usually depicted in candlestick charts
i = 0
no_bars = df.shape[0]
while i < no_bars:

    if df.loc[i]['Date'].time() in end_time:
        end_idx = i + 1

        hour_df = df[start_idx:end_idx]

        add(data, {
            'Date': df.loc[start_idx]['Date'],
            'O':    hour_df['O'].iloc[0],
            'H':    hour_df['H'].max(),
            'L':    hour_df['L'].min(),
            'C':    hour_df['C'].iloc[-1],
            'V':    hour_df['V'].sum()
        })

    if df.loc[i]['Date'].time() in start_time:
        start_idx = i

        # optional optimisation for large datasets
        # skip ahead to loop faster
        i += 55

    i += 1


df = DataFrame(data=data).set_index(keys=['Date'])
# df.to_csv('out.csv')
print(df)