我将30分钟数据重新采样为每小时数据,但是在24小时内添加了代表所有30分钟时段的NaN行。我希望只有在30分钟记录中有数据时才重新采样。原来的df没有任何额外的'行和每小时数据从9:30-4:00超过20天。它还包括新df_RSHourly的周末。
df_RSHourly = df.resample('1H', base=0.5).agg(
{'close': 'last','high': 'max','low': 'min', 'open': 'first', 'volume': 'sum'}
]
print df_RSHhourly
2017-04-25 09:30:00-04:00 238.75 238.52 237.91 237.81 151998.0
2017-04-25 10:30:00-04:00 238.62 238.44 238.53 238.33 64281.0
2017-04-25 11:30:00-04:00 238.66 238.56 238.44 238.36 58319.0
2017-04-25 12:30:00-04:00 238.71 238.59 238.56 238.29 47994.0
2017-04-25 13:30:00-04:00 238.82 238.69 238.59 238.52 58266.0
2017-04-25 14:30:00-04:00 238.95 238.84 238.69 238.57 73089.0
2017-04-25 15:30:00-04:00 238.83 238.53 238.83 238.53 103572.0
2017-04-25 16:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 17:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 18:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 19:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 20:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 21:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 22:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-25 23:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 00:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 01:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 02:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 03:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 04:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 05:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 06:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 07:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 08:30:00-04:00 NaN NaN NaN NaN NaN
2017-04-26 09:30:00-04:00 238.91 238.87 238.53 238.50 91978.0
2017-04-26 10:30:00-04:00 239.53 239.47 238.88 238.85 75444.0
2017-04-26 11:30:00-04:00 239.48 239.02 239.48 238.70 88402.0
2017-04-26 12:30:00-04:00 239.42 239.20 239.02 238.98 45661.0
答案 0 :(得分:1)
我找到的最简单的解决方案是between_time
df_RSHhourly.between_time('09:30', '16:00')
在我的代码中,这就是我的应用方式:
y = data['prices'].resample('60S').ohlc()
y = y.fillna(method='ffill')
y = y.between_time('09:30', '16:00')
参考:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html
答案 1 :(得分:0)
我在分钟重新采样方面遇到了类似的问题,我找到了2种方法来解决它
我最初通过添加一个实用程序列来解决它,该实用程序列检查是否应该包含日期/时间,然后我将条件为True的切片
def in_hours(row):
if row.name.hour >= 22
or row.name.hour < 9
or row.name.hour == 9 and row.name.minute < 30:
return False
return True
df['keep'] = df.apply(in_hours, axis=1)
df2 = dft[dft['keep']==True]
del dft['keep']
我发现这不是特别优雅或高效,因为这可能会导致重新采样产生大量无用的数据,以后才会被丢弃,但我找不到更聪明的方法。 另请注意,如果市场提前结束,“in_hours”需要额外的逻辑!
我每天在每日边界之间进行切片重新采样,然后连接每日数据帧,这是更多的内存和计算密集但更可靠
#create a colume with the day for grouping by
df['day'] = df.index
#group by day and get the max time, ie time of the last data of the day
df2 = df.day.groupby(pd.TimeGrouper('D')).max()
resampled_df_list = []
#for each day resample
for max_time in df2:
if type(max_time) is pd.tslib.Timestamp: # will be NaT on WE
end_time = max_time
start_time = datetime(max_time.year, max_time.month,
max_time.day, 0, 0)
df1d = df.loc[start_time:end_time].resample('1min').mean()
resampled_df_list.append(df1d)
#put it back together
new_df = pd.concat(resampled_df_list)