为Pandas DataFrame创建计算列的有效方法

时间:2018-10-16 11:05:31

标签: python pandas dataframe

给出以下df:

datetimeindex        store  sale   category  weekday
2018-10-13 09:27:01  gbn01  59.99  sporting  1
2018-10-13 09:27:01  gbn02  19.99  sporting  1
2018-10-13 09:27:02  gbn03  15.99  hygine    1
2018-10-13 09:27:03  gbn05  39.99  camping   1
....
2018-10-16 11:59:01  gbn01  19.99  other     0
2018-10-16 11:59:01  gbn02  49.99  sporting  0
2018-10-16 11:59:02  gbn03  10.00  food      0
2018-10-16 11:59:03  gbn05  89.99  electro   0
2018-10-16 12:30:03  gbn01  52.99
....
2018-10-16 21:05:03  gbn03  25.00  alcohol   0
2018-10-16 22:43:03  gbn01  10.05  health    0

更新

重新读取要求后,mean_sales会为该商店在该时段(08:00至18:00或12:00至13:00)的特定时间戳进行计算。我目前的想法是实现以下伪函数,但当前仅在按datetimeindex,store排序时才有效:

#Lunch_Time_Mean
count=0
Lunch_Sum_Previous=0
for r in df:
    if LunchHours & WeekDay:
        count++
        if count=1:
            r.Lunch_Mean=r.sale
            Lunch_Sum_Previous = r.sale
        elif count > 1:
            r.Lunch_Mean = Lunch_Sum_Previous + r.sale / count
            Lunch_Sum_Previous += r.sale
    else:
        r.Lunch_Mean=1
        count=0
        Lunch_Sum_Previous = 0

以上逻辑已映射到表:

datetimeindex       store    IsWorkingHour    count    sales    working_hour_sum    working_hour_cumsum    working_hour_mean_sales
13/10/2018 07:27    gbn01    0                0        39.18    0                   0                      1
13/10/2018 08:27    gbn01    1                1        31.69    31.69               31.69                  1
13/10/2018 09:27    gbn01    1                2        99.19    99.19               130.88                 1
13/10/2018 10:27    gbn01    1                3        25.89    25.89               156.77                 1
13/10/2018 11:27    gbn01    1                4        19.10    19.10               175.87                 1
13/10/2018 12:27    gbn01    1                5        82.51    82.51               258.38                 1
13/10/2018 13:27    gbn01    1                6        10.82    10.82               269.2                  1
13/10/2018 14:27    gbn01    1                7        10.43    10.43               279.63                 1
13/10/2018 15:27    gbn01    1                8        15.83    15.83               295.46                 1
13/10/2018 16:27    gbn01    1                9        12.53    12.53               307.99                 1
13/10/2018 17:27    gbn01    1                10       10.03    10.03               318.02                 1
13/10/2018 18:27    gbn01    0                0        54.14    0                   0                      1
13/10/2018 19:27    gbn01    0                0        20.04    0                   0                      1
#Above enteries have weekday_mean_sales of 0 because 13/10/2018 is on a weekend.                                                                                         
16/10/2018 07:27    gbn01    0                0        13.34    0                   0                      1
16/10/2018 08:27    gbn01    1                1        15.84    15.84               15.84                  15.84
16/10/2018 09:27    gbn01    1                2        19.14    19.14               34.98                  17.49
16/10/2018 10:27    gbn01    1                3        11.64    11.64               46.62                  15.54
16/10/2018 11:27    gbn01    1                4        17.54    17.54               64.16                  16.04
16/10/2018 12:27    gbn01    1                5        20.84    20.84               85                     17
16/10/2018 13:27    gbn01    1                6        50.05    50.05               135.05                 22.51
16/10/2018 14:27    gbn01    1                7        10.05    10.05               145.1                  20.73
16/10/2018 15:27    gbn01    1                8        13.35    13.35               158.45                 19.81
16/10/2018 16:27    gbn01    1                9        32.55    32.55               191                    21.22
16/10/2018 17:27    gbn01    1                10       13.36    13.36               204.36                 20.44
16/10/2018 18:27    gbn01    0                0        10.86    0                   0                      1
16/10/2018 19:27    gbn01    0                0        20.06    0                   0                      1

所需的输出

我正在尝试使用上述内容生成一个新的df,如下所示:

#I've simplified it to a single condition and store
datetimeindex       store    working_hour_mean_sales
13/10/2018 07:27    gbn01    1
13/10/2018 08:27    gbn01    1
13/10/2018 09:27    gbn01    1
13/10/2018 10:27    gbn01    1
13/10/2018 11:27    gbn01    1
13/10/2018 12:27    gbn01    1
13/10/2018 13:27    gbn01    1
13/10/2018 14:27    gbn01    1
13/10/2018 15:27    gbn01    1
13/10/2018 16:27    gbn01    1
13/10/2018 17:27    gbn01    1
13/10/2018 18:27    gbn01    1
13/10/2018 19:27    gbn01    1
#Above weekday_mean_sales=1 because 13/10/2018 was a weekend                         
16/10/2018 07:27    gbn01    1
16/10/2018 08:27    gbn01    15.84
16/10/2018 09:27    gbn01    17.49
16/10/2018 10:27    gbn01    15.54
16/10/2018 11:27    gbn01    16.04
16/10/2018 12:27    gbn01    17
16/10/2018 13:27    gbn01    22.51
16/10/2018 14:27    gbn01    20.73
16/10/2018 15:27    gbn01    19.81
16/10/2018 16:27    gbn01    21.22
16/10/2018 17:27    gbn01    20.44
16/10/2018 18:27    gbn01    1
16/10/2018 19:27    gbn01    1

“工作时间”为周一至周五08:00-18:00,“工作日午餐高峰”为12:00-13:30。

(N.B。我没有做出违反直觉的决定(至少对我来说是这样),即平日= 0表示周一至周五)

任何将其实现到大熊猫中的建议,将不胜感激!

3 个答案:

答案 0 :(得分:1)

您可以使用groupby()agg()between()

这将汇总周一至周五平日午餐高峰的结果:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('12:00:00','13:30:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

这将汇总周一至周五工作时间的结果:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('08:00:00','18:00:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

答案 1 :(得分:0)

尝试将数据分为几批,然后对每批数据进行汇总。最后,加入结果,除以条目数,然后将结果放入所需的列中。

您还可以通过多种方式来批处理数据,但是按照您的示例,我建议按类别将其分组并计算每个类别的所有内容,然后将其加入最终表。

希望对您有所帮助:)

答案 2 :(得分:0)

这应该指导您使用所需的逻辑。基本上,您定义了一个新列workinghoursweekdaylunchpeak并使用sqlcode进行聚合(还有其他方法)。

import pandasql as ps
import datetime
import numpy as np

mydata = pd.DataFrame(data={'datetimeindex': ['13/10/2018 09:27:01','13/10/2018 09:27:02','13/10/2018 09:27:03','13/10/2018 09:27:04','16/10/2018 11:59:01','16/10/2018 11:59:02','16/10/2018 11:59:03','16/10/2018 11:59:04','16/10/2018 21:05:01','16/10/2018 22:43:01'],
                       'store': ['gbn01','gbn02','gbn03','gbn05','gbn01','gbn02','gbn03','gbn05','gbn03','gbn01'],                        
                       'sale': [59.99,19.99,15.99,39.99,19.99,49.99,10,89.99,25,10.05],
                       'category': ['sporting','sporting','hygine','camping','other','sporting','food','electro','alcohol','health'],
                       'weekday': [1,1,1,1,0,0,0,0,0,0] 
                       })

mydata['datetimeindex'] = pd.to_datetime(mydata['datetimeindex'])
mydata['workinghours']=(
    np.where((mydata.datetimeindex.dt.time >= time(8,00))
             &
             (mydata.datetimeindex.dt.time<=time(18,00))
             &
             (mydata.weekday==0)
             , 1, 0))
mydata['weekdaylunchpeak']=(
    np.where((mydata.datetimeindex.dt.time >= time(12,00))
             &
             (mydata.datetimeindex.dt.time<=time(13,30))
             &
             (mydata.weekday==0)
             , 1, 0))

sqlcode = '''
SELECT 
    store,   
    category,
    avg(case when workinghours=1 then sale else 0 end) AS working_hours_mean_sales,
    avg(case when weekdaylunchpeak=1 then sale else 0 end) AS weekday_lunch_peak_mean_sales    
FROM mydata 

GROUP BY
store,   
    category

;
'''
newdf = ps.sqldf(sqlcode,locals()) 
newdf