如何在一天的时段内分组,并创建一个带有标签的新列,当时间段重复时,标签会递增?

时间:2019-05-25 12:21:18

标签: python python-3.x pandas dataframe pandas-groupby

当日期发生交叉时,我正在尝试在特定时间段之间对一组行进行分组。例如,我尝试对从1-1-2019 23:00:00到1-2-2019 4:00:00开始的行进行采样,并在此范围内创建一个带有标签的新列,其中标签会递增时间段重复时。另外,我打算提取“高”列中的最大值,并新建一个列“ Max_high”,类似地,在“低”列中提取最小值,并创建一个新的“ Min_low”。“代码1”

数据是按5分钟间隔采样的时间序列。

我已经设法通过使用Iloc并手动索引数据帧来提取“高”和“低”的max_high和min_low值,但是我希望将它们作为列插入到同一数据帧中。 “代码2”

我已经尝试了between_time()函数,并且在哪里查询熊猫和numpy,但我似乎没有得到我想要的东西。

我想念什么?

样本数据



              Date   Time     Open     High      Low    Close    Up  Down
0        05/18/2004  18:05  1090.75  1091.00  1090.75  1091.00    39     9
1        05/18/2004  18:10  1091.00  1091.00  1090.75  1091.00    23     2
2        05/18/2004  18:15  1091.00  1091.00  1090.75  1090.75    55    24
3        05/18/2004  18:20  1091.00  1091.00  1090.75  1090.75    61   458
4        05/18/2004  18:25  1090.75  1091.00  1090.50  1090.50     1    93
5        05/19/2004  00:00  1096.50  1096.50  1096.25  1096.25    11    10
6        05/19/2004  00:05  1096.25  1096.75  1096.25  1096.75    44    10
7        05/19/2004  00:10  1096.75  1096.75  1096.25  1096.50    15    133
8        05/19/2004  00:15  1096.50  1096.50  1096.25  1096.50    16    4

预期输出

         Date        Time   Open     High      Low     Close      Up  Down  /
0        05/18/2004  18:05  1090.75  1091.00  1090.75  1091.00    39     9
1        05/18/2004  18:10  1091.00  1091.00  1090.75  1091.00    23     2
2        05/18/2004  18:15  1091.00  1091.00  1090.75  1090.75    55    24
3        05/18/2004  18:20  1091.00  1091.00  1090.75  1090.75    61   458
4        05/18/2004  18:25  1090.75  1091.00  1090.50  1090.50     1    93
5        05/19/2004  00:00  1096.50  1096.50  1096.25  1096.25    11    10
6        05/19/2004  00:05  1096.25  1096.75  1096.25  1096.75    44    10
7        05/19/2004  00:10  1096.75  1096.75  1096.25  1096.50    15    133
8        05/19/2004  00:15  1096.50  1096.50  1096.25  1096.50    16    4

/  Max_high   Min_low
    1096.75    1090.75
    1096.75    1090.75
    1096.75    1090.75
    1096.75    1090.75
    1096.75    1090.75
    1096.75    1090.75
    1096.75    1090.75
    1096.75    1090.75

代码1:

import pandas as pd 
import numpy as np

data_df = pd.read_csv("data.csv")
data_df['Datetime'] = pd.to_datetime(data_df['Date'] + ' ' + data_df['Time'])
data_df = data_df.set_index('Datetime')
data_df_adv = data_df.drop(['Up','Down'],axis = 1)
data_df_adv['label'] = data_df_adv['Time'].apply(lambda x : 'time1' if x >= '09:00:00' and x < '16:00:00' else 'time2')
data_df_adv['max_high'] = data_df_adv.groupby(['Date','label'])['High'].transform(max)

输出代码1:

                           Date   Time     Open     High      Low    Close  \
Datetime                                                                     
2004-05-18 18:05:00  05/18/2004  18:05  1090.75  1091.00  1090.75  1091.00   
2004-05-18 18:10:00  05/18/2004  18:10  1091.00  1091.00  1090.75  1091.00   

                     label  max_high  label_low  
Datetime                                         
2004-05-18 18:05:00  time2   1097.25    1090.25  
2004-05-18 18:10:00  time2   1097.25    1090.25  

代码2:

import pandas as pd
import math

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%m/%d/%Y %H:%M')

final_data=[]
dt= pd.read_csv('data.csv',parse_dates=[['Date','Time']])

start_date  = dt['Date_Time'].iloc[0]
start_year  = dt['Date_Time'].iloc[0].year
start_month = dt['Date_Time'].iloc[0].month
end_date    = dt['Date_Time'].iloc[-1]
end_year    = dt['Date_Time'].iloc[-1].year
end_month   = dt['Date_Time'].iloc[-1].month

count = 1
count1 = 1
for y in range(int(start_year),int(end_year)+1):
    for m in range(int(start_month),int(end_month)+1):
        for d in range(1,31):
            st_d = format(y,'04d')+"-"+format(m,'02d')+"-"+format(d,'02d')
            st_dt = format(y,'04d')+"-"+format(m,'02d')+"-"+format(d,'02d')+" 16:00:00"
            ed_dt = format(y,'04d')+"-"+format(m,'02d')+"-"+format(int(d+1),'02d')+" 09:30:00"
            df = dt[(dt['Date_Time'] >= st_dt) & (dt['Date_Time'] < ed_dt)]
            if(not math.isnan(df['High'].max())):
                final_data.append(["period"+str(count), str(df['High'].max()), str(df['Low'].min())])
            if(dt.between_time('16:00','9:00')):
                dt['period'] = "period"+str(count)
            else:
                dt['period'] = "period0"

df1=pd.DataFrame(final_data)

输出代码2:

0   period18    1100.25 1090.25
1   period19    1090.0  1084.25
2   period20    1099.75 1088.75
3   period21    1094.0  1092.25
4   period23    1101.25 1092.75
5   period24    1098.75 1090.25
6   period25    1113.75 1108.75
7   period26    1119.5  1113.75
8   period27    1123.75 1118.75
9   period28    1123.25 1120.0

0 个答案:

没有答案