给出以下df:
datetimeindex store sale category weekday
2018-10-13 09:27:01 gbn01 59.99 sporting 1
2018-10-13 09:27:01 gbn02 19.99 sporting 1
2018-10-13 09:27:02 gbn03 15.99 hygine 1
2018-10-13 09:27:03 gbn05 39.99 camping 1
....
2018-10-16 11:59:01 gbn01 19.99 other 0
2018-10-16 11:59:01 gbn02 49.99 sporting 0
2018-10-16 11:59:02 gbn03 10.00 food 0
2018-10-16 11:59:03 gbn05 89.99 electro 0
2018-10-16 12:30:03 gbn01 52.99
....
2018-10-16 21:05:03 gbn03 25.00 alcohol 0
2018-10-16 22:43:03 gbn01 10.05 health 0
更新
重新读取要求后,mean_sales会为该商店在该时段(08:00至18:00或12:00至13:00)的特定时间戳进行计算。我目前的想法是实现以下伪函数,但当前仅在按datetimeindex,store排序时才有效:
#Lunch_Time_Mean
count=0
Lunch_Sum_Previous=0
for r in df:
if LunchHours & WeekDay:
count++
if count=1:
r.Lunch_Mean=r.sale
Lunch_Sum_Previous = r.sale
elif count > 1:
r.Lunch_Mean = Lunch_Sum_Previous + r.sale / count
Lunch_Sum_Previous += r.sale
else:
r.Lunch_Mean=1
count=0
Lunch_Sum_Previous = 0
以上逻辑已映射到表:
datetimeindex store IsWorkingHour count sales working_hour_sum working_hour_cumsum working_hour_mean_sales
13/10/2018 07:27 gbn01 0 0 39.18 0 0 1
13/10/2018 08:27 gbn01 1 1 31.69 31.69 31.69 1
13/10/2018 09:27 gbn01 1 2 99.19 99.19 130.88 1
13/10/2018 10:27 gbn01 1 3 25.89 25.89 156.77 1
13/10/2018 11:27 gbn01 1 4 19.10 19.10 175.87 1
13/10/2018 12:27 gbn01 1 5 82.51 82.51 258.38 1
13/10/2018 13:27 gbn01 1 6 10.82 10.82 269.2 1
13/10/2018 14:27 gbn01 1 7 10.43 10.43 279.63 1
13/10/2018 15:27 gbn01 1 8 15.83 15.83 295.46 1
13/10/2018 16:27 gbn01 1 9 12.53 12.53 307.99 1
13/10/2018 17:27 gbn01 1 10 10.03 10.03 318.02 1
13/10/2018 18:27 gbn01 0 0 54.14 0 0 1
13/10/2018 19:27 gbn01 0 0 20.04 0 0 1
#Above enteries have weekday_mean_sales of 0 because 13/10/2018 is on a weekend.
16/10/2018 07:27 gbn01 0 0 13.34 0 0 1
16/10/2018 08:27 gbn01 1 1 15.84 15.84 15.84 15.84
16/10/2018 09:27 gbn01 1 2 19.14 19.14 34.98 17.49
16/10/2018 10:27 gbn01 1 3 11.64 11.64 46.62 15.54
16/10/2018 11:27 gbn01 1 4 17.54 17.54 64.16 16.04
16/10/2018 12:27 gbn01 1 5 20.84 20.84 85 17
16/10/2018 13:27 gbn01 1 6 50.05 50.05 135.05 22.51
16/10/2018 14:27 gbn01 1 7 10.05 10.05 145.1 20.73
16/10/2018 15:27 gbn01 1 8 13.35 13.35 158.45 19.81
16/10/2018 16:27 gbn01 1 9 32.55 32.55 191 21.22
16/10/2018 17:27 gbn01 1 10 13.36 13.36 204.36 20.44
16/10/2018 18:27 gbn01 0 0 10.86 0 0 1
16/10/2018 19:27 gbn01 0 0 20.06 0 0 1
我正在尝试使用上述内容生成一个新的df,如下所示:
#I've simplified it to a single condition and store
datetimeindex store working_hour_mean_sales
13/10/2018 07:27 gbn01 1
13/10/2018 08:27 gbn01 1
13/10/2018 09:27 gbn01 1
13/10/2018 10:27 gbn01 1
13/10/2018 11:27 gbn01 1
13/10/2018 12:27 gbn01 1
13/10/2018 13:27 gbn01 1
13/10/2018 14:27 gbn01 1
13/10/2018 15:27 gbn01 1
13/10/2018 16:27 gbn01 1
13/10/2018 17:27 gbn01 1
13/10/2018 18:27 gbn01 1
13/10/2018 19:27 gbn01 1
#Above weekday_mean_sales=1 because 13/10/2018 was a weekend
16/10/2018 07:27 gbn01 1
16/10/2018 08:27 gbn01 15.84
16/10/2018 09:27 gbn01 17.49
16/10/2018 10:27 gbn01 15.54
16/10/2018 11:27 gbn01 16.04
16/10/2018 12:27 gbn01 17
16/10/2018 13:27 gbn01 22.51
16/10/2018 14:27 gbn01 20.73
16/10/2018 15:27 gbn01 19.81
16/10/2018 16:27 gbn01 21.22
16/10/2018 17:27 gbn01 20.44
16/10/2018 18:27 gbn01 1
16/10/2018 19:27 gbn01 1
“工作时间”为周一至周五08:00-18:00,“工作日午餐高峰”为12:00-13:30。
(N.B。我没有做出违反直觉的决定(至少对我来说是这样),即平日= 0表示周一至周五)
任何将其实现到大熊猫中的建议,将不胜感激!
答案 0 :(得分:1)
您可以使用groupby()
,agg()
和between()
。
这将汇总周一至周五平日午餐高峰的结果:
df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('12:00:00','13:30:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})
这将汇总周一至周五工作时间的结果:
df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('08:00:00','18:00:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})
答案 1 :(得分:0)
尝试将数据分为几批,然后对每批数据进行汇总。最后,加入结果,除以条目数,然后将结果放入所需的列中。
您还可以通过多种方式来批处理数据,但是按照您的示例,我建议按类别将其分组并计算每个类别的所有内容,然后将其加入最终表。
希望对您有所帮助:)
答案 2 :(得分:0)
这应该指导您使用所需的逻辑。基本上,您定义了一个新列workinghours
,weekdaylunchpeak
并使用sqlcode进行聚合(还有其他方法)。
import pandasql as ps
import datetime
import numpy as np
mydata = pd.DataFrame(data={'datetimeindex': ['13/10/2018 09:27:01','13/10/2018 09:27:02','13/10/2018 09:27:03','13/10/2018 09:27:04','16/10/2018 11:59:01','16/10/2018 11:59:02','16/10/2018 11:59:03','16/10/2018 11:59:04','16/10/2018 21:05:01','16/10/2018 22:43:01'],
'store': ['gbn01','gbn02','gbn03','gbn05','gbn01','gbn02','gbn03','gbn05','gbn03','gbn01'],
'sale': [59.99,19.99,15.99,39.99,19.99,49.99,10,89.99,25,10.05],
'category': ['sporting','sporting','hygine','camping','other','sporting','food','electro','alcohol','health'],
'weekday': [1,1,1,1,0,0,0,0,0,0]
})
mydata['datetimeindex'] = pd.to_datetime(mydata['datetimeindex'])
mydata['workinghours']=(
np.where((mydata.datetimeindex.dt.time >= time(8,00))
&
(mydata.datetimeindex.dt.time<=time(18,00))
&
(mydata.weekday==0)
, 1, 0))
mydata['weekdaylunchpeak']=(
np.where((mydata.datetimeindex.dt.time >= time(12,00))
&
(mydata.datetimeindex.dt.time<=time(13,30))
&
(mydata.weekday==0)
, 1, 0))
sqlcode = '''
SELECT
store,
category,
avg(case when workinghours=1 then sale else 0 end) AS working_hours_mean_sales,
avg(case when weekdaylunchpeak=1 then sale else 0 end) AS weekday_lunch_peak_mean_sales
FROM mydata
GROUP BY
store,
category
;
'''
newdf = ps.sqldf(sqlcode,locals())
newdf