当日期发生交叉时,我正在尝试在特定时间段之间对一组行进行分组。例如,我尝试对从1-1-2019 23:00:00到1-2-2019 4:00:00开始的行进行采样,并在此范围内创建一个带有标签的新列,其中标签会递增时间段重复时。另外,我打算提取“高”列中的最大值,并新建一个列“ Max_high”,类似地,在“低”列中提取最小值,并创建一个新的“ Min_low”。“代码1”
数据是按5分钟间隔采样的时间序列。
我已经设法通过使用Iloc并手动索引数据帧来提取“高”和“低”的max_high和min_low值,但是我希望将它们作为列插入到同一数据帧中。 “代码2”
我已经尝试了between_time()函数,并且在哪里查询熊猫和numpy,但我似乎没有得到我想要的东西。
我想念什么?
样本数据
Date Time Open High Low Close Up Down
0 05/18/2004 18:05 1090.75 1091.00 1090.75 1091.00 39 9
1 05/18/2004 18:10 1091.00 1091.00 1090.75 1091.00 23 2
2 05/18/2004 18:15 1091.00 1091.00 1090.75 1090.75 55 24
3 05/18/2004 18:20 1091.00 1091.00 1090.75 1090.75 61 458
4 05/18/2004 18:25 1090.75 1091.00 1090.50 1090.50 1 93
5 05/19/2004 00:00 1096.50 1096.50 1096.25 1096.25 11 10
6 05/19/2004 00:05 1096.25 1096.75 1096.25 1096.75 44 10
7 05/19/2004 00:10 1096.75 1096.75 1096.25 1096.50 15 133
8 05/19/2004 00:15 1096.50 1096.50 1096.25 1096.50 16 4
预期输出
Date Time Open High Low Close Up Down /
0 05/18/2004 18:05 1090.75 1091.00 1090.75 1091.00 39 9
1 05/18/2004 18:10 1091.00 1091.00 1090.75 1091.00 23 2
2 05/18/2004 18:15 1091.00 1091.00 1090.75 1090.75 55 24
3 05/18/2004 18:20 1091.00 1091.00 1090.75 1090.75 61 458
4 05/18/2004 18:25 1090.75 1091.00 1090.50 1090.50 1 93
5 05/19/2004 00:00 1096.50 1096.50 1096.25 1096.25 11 10
6 05/19/2004 00:05 1096.25 1096.75 1096.25 1096.75 44 10
7 05/19/2004 00:10 1096.75 1096.75 1096.25 1096.50 15 133
8 05/19/2004 00:15 1096.50 1096.50 1096.25 1096.50 16 4
/ Max_high Min_low
1096.75 1090.75
1096.75 1090.75
1096.75 1090.75
1096.75 1090.75
1096.75 1090.75
1096.75 1090.75
1096.75 1090.75
1096.75 1090.75
代码1:
import pandas as pd
import numpy as np
data_df = pd.read_csv("data.csv")
data_df['Datetime'] = pd.to_datetime(data_df['Date'] + ' ' + data_df['Time'])
data_df = data_df.set_index('Datetime')
data_df_adv = data_df.drop(['Up','Down'],axis = 1)
data_df_adv['label'] = data_df_adv['Time'].apply(lambda x : 'time1' if x >= '09:00:00' and x < '16:00:00' else 'time2')
data_df_adv['max_high'] = data_df_adv.groupby(['Date','label'])['High'].transform(max)
输出代码1:
Date Time Open High Low Close \
Datetime
2004-05-18 18:05:00 05/18/2004 18:05 1090.75 1091.00 1090.75 1091.00
2004-05-18 18:10:00 05/18/2004 18:10 1091.00 1091.00 1090.75 1091.00
label max_high label_low
Datetime
2004-05-18 18:05:00 time2 1097.25 1090.25
2004-05-18 18:10:00 time2 1097.25 1090.25
代码2:
import pandas as pd
import math
def dateparse(d,t):
dt = d + " " + t
return pd.datetime.strptime(dt, '%m/%d/%Y %H:%M')
final_data=[]
dt= pd.read_csv('data.csv',parse_dates=[['Date','Time']])
start_date = dt['Date_Time'].iloc[0]
start_year = dt['Date_Time'].iloc[0].year
start_month = dt['Date_Time'].iloc[0].month
end_date = dt['Date_Time'].iloc[-1]
end_year = dt['Date_Time'].iloc[-1].year
end_month = dt['Date_Time'].iloc[-1].month
count = 1
count1 = 1
for y in range(int(start_year),int(end_year)+1):
for m in range(int(start_month),int(end_month)+1):
for d in range(1,31):
st_d = format(y,'04d')+"-"+format(m,'02d')+"-"+format(d,'02d')
st_dt = format(y,'04d')+"-"+format(m,'02d')+"-"+format(d,'02d')+" 16:00:00"
ed_dt = format(y,'04d')+"-"+format(m,'02d')+"-"+format(int(d+1),'02d')+" 09:30:00"
df = dt[(dt['Date_Time'] >= st_dt) & (dt['Date_Time'] < ed_dt)]
if(not math.isnan(df['High'].max())):
final_data.append(["period"+str(count), str(df['High'].max()), str(df['Low'].min())])
if(dt.between_time('16:00','9:00')):
dt['period'] = "period"+str(count)
else:
dt['period'] = "period0"
df1=pd.DataFrame(final_data)
输出代码2:
0 period18 1100.25 1090.25
1 period19 1090.0 1084.25
2 period20 1099.75 1088.75
3 period21 1094.0 1092.25
4 period23 1101.25 1092.75
5 period24 1098.75 1090.25
6 period25 1113.75 1108.75
7 period26 1119.5 1113.75
8 period27 1123.75 1118.75
9 period28 1123.25 1120.0