我有一个大型数据框,其形式为:
Date Time station num_bikes num_racks
01/10/18 3.02 Girwood 5 6
01/10/18 3.03 Girwood 6 5
01/10/18 3.04 Girwood 2 9
01/10/18 3.05 Girwood 9 2
12/08/18 4.10 Fraser 0 14
12/08/18 4.11 Fraser 1 13
12/08/18 4.12 Fraser 0 14
03/09/18 10.10 Carslile 2 8
03/09/18 10.11 Carslile 4 6
03/09/18 10.12 Carslile 0 10
24/09/18 10.10 Girwood 9 3
24/09/18 10.11 Girwood 10 2
24/09/18 10.12 Girwood 4 8
变量“ num_bikes”是指定日期和时间在车站上存在的自行车数量“ num_racks”是指定日期和时间在车站上可用的空车架。我有3个月中每秒的数据而且我希望能够确定每个月每个站点的自行车到达次数和自行车离开次数。
我希望输出看起来像这样:
Station Month Arrivals Departures
Girwood August 5 2
Fraser August 1 1 ie
Girwood September 1 6 ie
Girwood October 3 4 ie
Carslile September 2 4 ie
答案 0 :(得分:0)
您的预期输出仍然与示例数据集中的数据不匹配,据我所知是不正确的。我认为这是您要实现的目标,但不确定100%。
# convert date to datetime
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
# create month col
df['month'] = df['Date'].dt.month_name()
# create a copy so you are not updating the original df
df2 = df.copy()
# groupby and calc the diff
df2['diff'] = df2.groupby(['month', 'station'])[['num_bikes']].diff()
# sum all positive numbers as arrivals because every time the num_bikes increases it is an arrival
# sum all negative numbers as departures because every time the num_bikes decreases it is an departure
df2[['month', 'station', 'diff']].groupby(['month', 'station'])['diff'].agg([('arrivals' , lambda x : x[x > 0].sum()),
('departures' , lambda x : abs(x[x < 0].sum()))])
arrivals departures
month station
August Fraser 1.0 1.0
October Girwood 8.0 4.0
September Carslile 2.0 4.0
Girwood 1.0 6.0