python-GroupBy总和在开始和结束日期范围内的比例

时间:2019-01-07 23:45:24

标签: python python-3.x pandas datetime dataframe

我正在查看python发现过程中的这一无人机租赁数据集,并试图GroupBy 结果列显示每架无人机每个月的产量。

如果结果与特定日期相关联,我通常可以这样做,但是由于这是一项长期租赁业务,因此我需要计算出结果的多少可归因于开始日期和结束日期之间的每个月。

+------+------------------+------------------+--------+
| Drone|     Start        |      End         | Result |
+------+------------------+------------------+--------+
| DR1    16/06/2013 10:30   22/08/2013 07:00    2786  |
| DR1    20/04/2013 23:30   16/06/2013 10:30    7126  |
| DR1    24/01/2013 23:00   20/04/2013 23:30    2964  |
| DR2    01/03/2014 19:00   07/05/2014 18:00    8884  |
| DR2    04/09/2015 09:00   04/11/2015 07:00    7828  |
| DR2    04/10/2013 05:00   24/12/2013 07:00    5700  |
+-----------------------------------------------------+

我可以使用以下方法找到日期的差异:

import datetime
from dateutil.relativedelta import relativedelta
df.Start = pd.to_datetime(df.Start)
df.End = pd.to_datetime(df.End)
a = df.loc[0, 'Start']
b = df.loc[0, 'End']
relativedelta(a,b)

但是输出结果如下:

  

相对delta(月= -2,天= -5,小时= -20,分钟= -30)

并且我无法使用它使用GroupBy来计算现金归因,就像数据集只有一个日期一样

df.groupby(['Device', 'Date']).agg(sum)['Result']

对于解决此类问题的正确思考过程以及代码的外观,我将提供一些帮助。

以每种无人机类型的第一个示例为例, 我的预期输出将是:

+------+-------+-------+---------+
|Drone | Month | Days  |  Result |
+------+-------+-------+---------+
|DR1     June      X       $YY   |
|DR1     July      X       $YY   |
|DR1     August    X       $YY   |
|DR2     March     Y       $ZZ   |
|DR2     April     Y       $ZZ   |
|DR2     May       Y       $ZZ   |
+--------------------------------+

谢谢

1 个答案:

答案 0 :(得分:3)

这是一个循环的解决方案,但我认为它可以满足您的要求。

# Just load the sample data
from io import StringIO
data = 'Drone,Start,End,Result\n' + \
    'DR1,16/06/2013 10:30,22/08/2013 07:00,2786\n' + \
    'DR1,20/04/2013 23:30,16/06/2013 10:30,7126\n' + \
    'DR1,24/01/2013 23:00,20/04/2013 23:30,2964\n' + \
    'DR2,01/03/2014 19:00,07/05/2014 18:00,8884\n' + \
    'DR2,04/09/2015 09:00,04/11/2015 07:00,7828\n' + \
    'DR2,04/10/2013 05:00,24/12/2013 07:00,5700\n'
stream = StringIO(data)

# Actual solution
import pandas as pd
from datetime import datetime

df = pd.read_csv(stream, sep=',', parse_dates=[1, 2])

def get_month_spans(row):
    month_spans = []
    start = row['Start']
    total_delta = (row['End'] - row['Start']).total_seconds()
    while row['End'] > start:
        if start.month != 12:
            end = datetime(year=start.year, month=start.month+1, day=1)
        else:
            end = datetime(year=start.year+1, month=1, day=1)
        if end > row['End']:
            end = row['End']
        delta = (end - start).total_seconds()
        proportional = row['Result'] * (delta / total_delta)
        month_spans.append({'Drone': row['Drone'],
                            'Month': datetime(year=start.year,
                                              month=start.month,
                                              day=1),
                            'Result': proportional,
                            'Days': delta / (24 * 3600)})
        start = end
        print(delta)
    return month_spans

month_spans = []
for index, row in df.iterrows():
    month_spans += get_month_spans(row)
monthly = pd.DataFrame(month_spans).groupby(['Drone', 'Month']).agg(sum)[['Result', 'Days']]

print(monthly)

哪个会输出每个无人机每月的产量以及天数:

                       Result       Days
Drone Month                             
DR1   2013-01-01   242.633083   7.041667
      2013-02-01   964.789537  28.000000
      2013-03-01  1068.159845  31.000000
      2013-04-01  1953.216797  30.000000
      2013-05-01  3912.726199  31.000000
      2013-06-01  2555.334620  30.000000
      2013-07-01  1291.856653  31.000000
      2013-08-01   887.283266  21.291667
DR2   2013-04-01   459.202454  20.791667
      2013-05-01   684.662577  31.000000
      2013-06-01   662.576687  30.000000
      2013-07-01   684.662577  31.000000
      2013-08-01   684.662577  31.000000
      2013-09-01   662.576687  30.000000
      2013-10-01   684.662577  31.000000
      2013-11-01   662.576687  30.000000
      2013-12-01   514.417178  23.291667
      2014-01-01  1369.726258  28.208333
      2014-02-01  1359.610112  28.000000
      2014-03-01  1505.282624  31.000000
      2014-04-01  1456.725120  30.000000
      2014-05-01  1505.282624  31.000000
      2014-06-01  1456.725120  30.000000
      2014-07-01   230.648144   4.750000
      2015-04-01  7828.000000   1.916667