在特定时间范围内来自pandas数据帧的值的总和

时间:2017-09-28 03:41:16

标签: python pandas datetime dataframe group-by

我有一个相当复杂的问题。我需要在一组开始和结束日期内从数据框中选择行,然后对这些值求和并将它们放在新的数据帧中。

所以我从数据框df开始:

import random
dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
                   'time_stamp': dates,
                   'value': random.choice(range(2,60))
                  })

并定义一些开始和结束日期:

import pandas as pd
start_date = ["2-13-16", "2-23-16", "3-17-16", "3-24-16", "3-26-16", "5-17-16", "5-25-16", "10-10-16", "10-18-16", "10-23-16", "10-31-16", "11-7-16", "11-14-16", "11-22-16", "1-23-17", "1-29-17", "2-06-17", "3-11-17", "3-23-17", "6-21-17", "6-28-17"]
end_date = pd.DatetimeIndex(start_date) + pd.DateOffset(7)

然后需要发生的是我需要创建一个新的数据框,其中weekly_sum汇总了value dfstart_dateend_datevalues

因此,例如,新数据帧的第一行将返回{3}在2-13-16和2-20-16之间的总和。我想我会使用groupby.sum()或类似的东西。

可能看起来像这样:

id      start_date   end_date    weekly_sum
65      2016-02-13   2016-02-20  100

非常感谢任何方向!

P.S。我知道我对random.choice的使用有点不稳定,所以如果你有更好的方法来生成随机数,我很乐意看到它!

2 个答案:

答案 0 :(得分:2)

您可以使用

def get_dates(x):
    # Select the df values between start and ending datetime. 
    n = df[(df['time_stamp']>x['start'])&(df['time_stamp']<x['end'])]
    # Return first id and sum of values
    return n['id'].values[0],n['value'].sum()

dates = pd.date_range('20150101 020000',periods=1000)

df = pd.DataFrame({'id': np.random.randint(0,1000,size=(1000,)),
               'time_stamp': dates,
               'value': np.random.randint(2,60,size=(1000,))
              })

ndf = pd.DataFrame({'start':pd.to_datetime(start_date),'end':end_date})
#Unpack and assign values to id and value column
ndf[['id','value']] = ndf.apply(lambda x : get_dates(x),1).apply(pd.Series)
print(df.head(5))
   id          time_stamp  value
0  770 2015-01-01 02:00:00     59
1  781 2015-01-02 02:00:00     32
2  761 2015-01-03 02:00:00     40
3  317 2015-01-04 02:00:00     16
4  538 2015-01-05 02:00:00     20

print(ndf.head(5))

         end      start   id  value
0 2016-02-20 2016-02-13  569    221
1 2016-03-01 2016-02-23   28    216
2 2016-03-24 2016-03-17  152    258
3 2016-03-31 2016-03-24  892    265
4 2016-04-02 2016-03-26  606    244

答案 1 :(得分:0)

您可以使用以下代码计算每周摘要。以下代码基于星期一。

import pandas as pd
import random

dates = pd.date_range('20150101 020000',periods=1000)
df = pd.DataFrame({'_id': random.choice(range(0, 1000)),
                   'time_stamp': dates,
                   'value': random.choice(range(2,60))
                  })

df['day_of_week'] = df['time_stamp'].dt.weekday_name
df['start'] = np.where(df["day_of_week"]=="Monday", 1, 0)
df['week'] = df["start"].cumsum()
# It is based on Monday.
df.head(20)
# Out[109]:
#     _id          time_stamp  value day_of_week  start  week
# 0   396 2015-01-01 02:00:00     59    Thursday      0     0
# 1   396 2015-01-02 02:00:00     59      Friday      0     0
# 2   396 2015-01-03 02:00:00     59    Saturday      0     0
# 3   396 2015-01-04 02:00:00     59      Sunday      0     0
# 4   396 2015-01-05 02:00:00     59      Monday      1     1
# 5   396 2015-01-06 02:00:00     59     Tuesday      0     1
# 6   396 2015-01-07 02:00:00     59   Wednesday      0     1
# 7   396 2015-01-08 02:00:00     59    Thursday      0     1
# 8   396 2015-01-09 02:00:00     59      Friday      0     1
# 9   396 2015-01-10 02:00:00     59    Saturday      0     1
# 10  396 2015-01-11 02:00:00     59      Sunday      0     1
# 11  396 2015-01-12 02:00:00     59      Monday      1     2
# 12  396 2015-01-13 02:00:00     59     Tuesday      0     2
# 13  396 2015-01-14 02:00:00     59   Wednesday      0     2
# 14  396 2015-01-15 02:00:00     59    Thursday      0     2
# 15  396 2015-01-16 02:00:00     59      Friday      0     2
# 16  396 2015-01-17 02:00:00     59    Saturday      0     2
# 17  396 2015-01-18 02:00:00     59      Sunday      0     2
# 18  396 2015-01-19 02:00:00     59      Monday      1     3
# 19  396 2015-01-20 02:00:00     59     Tuesday      0     3

aggfunc = {"time_stamp": [np.min, np.max], "value": [np.sum]}
df2 = df.groupby("week", as_index=False).agg(aggfunc)
df2.columns = ["week", "start_date", "end_date", "weekly_sum"]
df2.iloc[58:61]
# Out[110]:
#     week          start_date            end_date  weekly_sum
# 58    58 2016-02-08 02:00:00 2016-02-14 02:00:00         413
# 59    59 2016-02-15 02:00:00 2016-02-21 02:00:00         413
# 60    60 2016-02-22 02:00:00 2016-02-28 02:00:00         413