有大量关于Pandas groupby的缓慢帖子,但它们似乎都在某些方面有所不同,而且我如何将其转化为我的问题尚不清楚。
让我们从我可以解决的问题的简单版本开始,然后从那里开始逐步构建。
(1)通过每5个时间戳加col1
来合并时间序列数据并创建ohlc条:
import pandas as pd
import random
# set seed in case reproducability becomes useful in the future
random.seed(13)
# create a weeks worth time points
# NOTE: this is evenly spaced but in real life is not (I can change make this more realistic if someone thinks it's important)
periods = 7 * 24 * 60
time_range = pd.date_range('2016-07-01', periods=periods, freq='T')
df = pd.DataFrame({'col1': [random.random() for _ in range(len(time_range))], 'col2': [random.randint(1, 10) * random.random() for _ in range(len(time_range))]}, index = time_range)
# pandas has some great methods that do things really fast. For example grouping every 5 time stamps and putting into ohlc bars can be done with
df.reset_index(inplace = True)
print(df.head())
df['col1'].groupby(df.index // 5).ohlc()
(2)如果我想添加两列以便我们知道每个小节的开始和结束时间怎么办?
(3)此外,如果我们想按更复杂的功能分组怎么办?例如,是否有一种快速的方法来为col1创建ohlc条,以使每个条都包含最小数量的时间戳,以使col1 * col2之和> = 10?我们也想知道开关图章。
这是我的工作(但尝试很慢):
# We start by looking for the smallest range of indexes that meets the condition
base_idx = df.index[0] # start the range at the beginning of the DF
group_counter = 1 # all the ranges need to be given group numbers so that it can be grouped at the end
group_column = [0 for idx in df.index] # this column will be added to the DF at the end indicating which row belongs to which group
group_count_to_start_and_end_date_dict = {} # this takes a group number as a key and returns the open and close time stamp for that group
for idx in df.index: # loop through all indexes
if idx == df.index[-1]: #if idx made it to the end of the DF then just put it all together into the final group even if it it doesn't meet the condition to make a group
group_column[base_idx:idx + 1] = [group_counter] * len(group_column[base_idx:idx + 1])
group_count_to_start_and_end_date_dict[group_counter] = [df.loc[base_idx, 'index'], df.loc[idx, 'index']]
elif (df.loc[base_idx:idx,'col2'] * df.loc[base_idx:idx, 'col1']).sum() >= 10: # if the grouping conidtion is met then add the new group
group_column[base_idx:idx] = [group_counter] * len(group_column[base_idx:idx])
group_count_to_start_and_end_date_dict[group_counter] = [df.loc[base_idx, 'index'], df.loc[idx, 'index']]
base_idx = idx # start a new range
group_counter += 1 # start a new group
df['groupings'] = group_column # add groupings colun to the df
# perform group by and create ohlc bars
grouped1 = df.groupby('groupings')
grouped = grouped1.col1.ohlc()
# add the open and close time stamps for each bar
grouped['open_stamp'] = grouped.index.map(lambda x: group_count_to_start_and_end_date_dict[x][0])
grouped['close_stamp'] = grouped.index.map(lambda x: group_count_to_start_and_end_date_dict[x][1])
有人可以帮我提高性能吗?
答案 0 :(得分:1)
您可以在两列的乘法运算中使用cumsum
来创建列分组,然后使用数组操作删除大于10的值并重新开始累积总和,例如:
#need these 2 arrays for the calculation
arr_mult = (df.col1*df.col2).values
arr = arr_mult.cumsum().copy()
gr = np.zeros_like(arr)
for i in range(len(arr)-1):
if arr[i] >= 10:
# recalculated the rest of the array once above 10
arr[i:] -= arr[i] - arr_mult[i]
# put one where a new group should start
gr[i] = 1
df['groupings'] = gr.cumsum() + 1
然后要获取结果,可以连接col1的ohlc并在列索引上使用first和last:
grouped = pd.concat([ df.groupby('groupings').col1.ohlc(),
df.groupby('groupings').index.agg(['first', 'last'])], axis=1)\
.rename(columns = {'first': 'open_stamp','last': 'close_stamp'})
print (grouped.head())
open high low close open_stamp \
groupings
1.0 0.259008 0.685258 0.259008 0.684082 2016-07-01 00:00:00
2.0 0.849336 0.849336 0.147160 0.225163 2016-07-01 00:03:00
3.0 0.734024 0.837657 0.014432 0.014432 2016-07-01 00:08:00
4.0 0.275837 0.949323 0.146710 0.256708 2016-07-01 00:17:00
5.0 0.849939 0.849939 0.486785 0.486785 2016-07-01 00:27:00
close_stamp
groupings
1.0 2016-07-01 00:02:00
2.0 2016-07-01 00:07:00
3.0 2016-07-01 00:16:00
4.0 2016-07-01 00:26:00
5.0 2016-07-01 00:28:00
请注意您的代码,所谓的close_stamp实际上是下一组的open_stamp,而我假设您希望此代码获得当前组的最后一个标记。我认为它应该比您的代码更有效