熊猫重新取样OHLC日内数据,不包括在正常交易时间之外

时间:2017-05-01 21:18:58

标签: pandas

我将30分钟数据重新采样为每小时数据,但是在24小时内添加了代表所有30分钟时段的NaN行。我希望只有在30分钟记录中有数据时才重新采样。原来的df没有任何额外的'行和每小时数据从9:30-4:00超过20天。它还包括新df_RSHourly的周末。

df_RSHourly = df.resample('1H', base=0.5).agg(
    {'close': 'last','high': 'max','low': 'min', 'open': 'first', 'volume': 'sum'}
]

print df_RSHhourly

2017-04-25 09:30:00-04:00  238.75  238.52  237.91  237.81  151998.0
2017-04-25 10:30:00-04:00  238.62  238.44  238.53  238.33   64281.0
2017-04-25 11:30:00-04:00  238.66  238.56  238.44  238.36   58319.0
2017-04-25 12:30:00-04:00  238.71  238.59  238.56  238.29   47994.0
2017-04-25 13:30:00-04:00  238.82  238.69  238.59  238.52   58266.0
2017-04-25 14:30:00-04:00  238.95  238.84  238.69  238.57   73089.0
2017-04-25 15:30:00-04:00  238.83  238.53  238.83  238.53  103572.0
2017-04-25 16:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 17:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 18:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 19:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 20:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 21:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 22:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-25 23:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 00:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 01:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 02:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 03:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 04:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 05:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 06:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 07:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 08:30:00-04:00     NaN     NaN     NaN     NaN       NaN
2017-04-26 09:30:00-04:00  238.91  238.87  238.53  238.50   91978.0
2017-04-26 10:30:00-04:00  239.53  239.47  238.88  238.85   75444.0
2017-04-26 11:30:00-04:00  239.48  239.02  239.48  238.70   88402.0
2017-04-26 12:30:00-04:00  239.42  239.20  239.02  238.98   45661.0

2 个答案:

答案 0 :(得分:1)

我找到的最简单的解决方案是between_time

df_RSHhourly.between_time('09:30', '16:00')

在我的代码中,这就是我的应用方式:

y = data['prices'].resample('60S').ohlc()
y = y.fillna(method='ffill')
y = y.between_time('09:30', '16:00')

参考:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html

答案 1 :(得分:0)

我在分钟重新采样方面遇到了类似的问题,我找到了2种方法来解决它

简单但效率低下的方式

我最初通过添加一个实用程序列来解决它,该实用程序列检查是否应该包含日期/时间,然后我将条件为True的切片

def in_hours(row):
    if row.name.hour >= 22 
       or row.name.hour < 9 
       or row.name.hour == 9 and row.name.minute < 30:
        return False
    return True

df['keep'] = df.apply(in_hours, axis=1)
df2 = dft[dft['keep']==True]
del dft['keep']

我发现这不是特别优雅或高效,因为这可能会导致重新采样产生大量无用的数据,以后才会被丢弃,但我找不到更聪明的方法。 另请注意,如果市场提前结束,“in_hours”需要额外的逻辑!

更可靠的方式

我每天在每日边界之间进行切片重新采样,然后连接每日数据帧,这是更多的内存和计算密集但更可靠

#create a colume with the day for grouping by
df['day'] = df.index
#group by day and get the max time, ie time of the last data of the day
df2 = df.day.groupby(pd.TimeGrouper('D')).max()
resampled_df_list = []

#for each day resample
for max_time in df2:
    if type(max_time) is pd.tslib.Timestamp: # will be NaT on WE
        end_time = max_time
        start_time = datetime(max_time.year, max_time.month, 
max_time.day, 0, 0)
        df1d = df.loc[start_time:end_time].resample('1min').mean()
        resampled_df_list.append(df1d)

#put it back together
new_df = pd.concat(resampled_df_list)