python按周或月拆分一个pandas数据框,并根据这些sp对数据进行分组

时间:2012-11-04 21:56:15

标签: python datetime group-by dataframe pandas

DateOccurred    CostCentre  TimeDifference
03/09/2012  2073    28138
03/09/2012  6078    34844
03/09/2012  8273    31215
03/09/2012  8367    28160
03/09/2012  8959    32037
03/09/2012  9292    30118
03/09/2012  9532    34200
03/09/2012  9705    27240
03/09/2012  10085   31431
03/09/2012  10220   22555
04/09/2012  6078    41126
04/09/2012  7569    31101
04/09/2012  8273    30994
04/09/2012  8959    30064
04/09/2012  9532    34655
04/09/2012  9705    26475
04/09/2012  10085   31443
04/09/2012  10220   33970
05/09/2012  2073    28221
05/09/2012  6078    27894
05/09/2012  7569    29012
05/09/2012  8239    42208
05/09/2012  8273    31128
05/09/2012  8367    27993
05/09/2012  8959    20669
05/09/2012  9292    33070
05/09/2012  9532    8189
05/09/2012  9705    27540
05/09/2012  10085   28798
05/09/2012  10220   23164
06/09/2012  2073    28350
06/09/2012  6078    35648
06/09/2012  7042    27129
06/09/2012  7569    31546
06/09/2012  8239    39945
06/09/2012  8273    31107
06/09/2012  8367    27795
06/09/2012  9292    32974
06/09/2012  9532    30320
06/09/2012  9705    37462
06/09/2012  10085   31703
06/09/2012  10220   7807
06/09/2012  14573   186
07/09/2012  0   0
07/09/2012  0   0
07/09/2012  2073    28036
07/09/2012  6078    31969
07/09/2012  7569    32941
07/09/2012  8273    30073
07/09/2012  8367    29391
07/09/2012  9292    31927
07/09/2012  9532    30127
07/09/2012  9705    27604
07/09/2012  10085   28108
08/09/2012  2073    28463
10/09/2012  6078    31266
10/09/2012  8239    16390
10/09/2012  8273    31140
10/09/2012  8959    30858
10/09/2012  9532    30794
10/09/2012  9705    28752
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  0   0
11/09/2012  2073    28159
11/09/2012  6078    36835
11/09/2012  8239    45354
11/09/2012  8273    30922
11/09/2012  8367    31382
11/09/2012  8959    29670
11/09/2012  9292    33582
11/09/2012  9705    29394
11/09/2012  10085   17140
12/09/2012  2073    28283
12/09/2012  6078    31139
12/09/2012  7042    35063
12/09/2012  8273    31075
12/09/2012  8367    29795
12/09/2012  9292    33496
12/09/2012  9532    31669
12/09/2012  9705    26166
12/09/2012  10085   29889
12/09/2012  10220   35656
13/09/2012  2073    28144
13/09/2012  6078    30544
13/09/2012  7097    30866
13/09/2012  8273    30772
13/09/2012  8367    32387
13/09/2012  8959    29307
13/09/2012  9292    32348
13/09/2012  9532    28137
13/09/2012  9705    28823
13/09/2012  10085   31543
13/09/2012  10220   28293
14/09/2012  0   12433
14/09/2012  0   12434
14/09/2012  0   12434
14/09/2012  0   12434
14/09/2012  0   12434
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   12433
14/09/2012  0   0
14/09/2012  0   12433
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   1720
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   384
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   0
14/09/2012  0   383
14/09/2012  2073    28438
14/09/2012  6078    27255
14/09/2012  8273    29989
14/09/2012  8959    26892
14/09/2012  9292    33202
14/09/2012  9532    30862
14/09/2012  9705    26857
14/09/2012  10085   32657
14/09/2012  10220   27296
15/09/2012  6078    3832
17/09/2012  6078    30004
17/09/2012  7569    30390
17/09/2012  8239    41421
17/09/2012  8273    26337
17/09/2012  8367    31631
17/09/2012  8959    17989
17/09/2012  9292    35703
17/09/2012  9532    36542
17/09/2012  9705    27488
17/09/2012  10085   30849
17/09/2012  10220   32575
18/09/2012  2073    28293
18/09/2012  6078    27450
18/09/2012  7569    30323
18/09/2012  8239    38481
18/09/2012  8273    31154
18/09/2012  8367    27944
18/09/2012  8959    28196
18/09/2012  9292    30844
18/09/2012  9532    33128
18/09/2012  9705    32100
19/09/2012  2073    28227
19/09/2012  6078    32243
19/09/2012  7569    29041
19/09/2012  8239    42791
19/09/2012  8273    30966
19/09/2012  8367    26420
19/09/2012  8959    29394
19/09/2012  9292    14865
19/09/2012  9532    23618
19/09/2012  10085   31614
19/09/2012  10220   8686
20/09/2012  2073    28260
20/09/2012  6078    30446
20/09/2012  7097    34909
20/09/2012  7569    30869
20/09/2012  8273    31079
20/09/2012  8367    30162
20/09/2012  9292    13104
20/09/2012  9532    36614
20/09/2012  9705    35617
20/09/2012  10085   31821
20/09/2012  10220   30055
20/09/2012  14573   468
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   3
21/09/2012  0   0
21/09/2012  0   0
21/09/2012  0   3
21/09/2012  2073    28308
21/09/2012  6078    33833
21/09/2012  7569    32335
21/09/2012  9292    33824
21/09/2012  9532    33376
21/09/2012  10220   21002
22/09/2012  2073    28402
23/09/2012  2073    28109
24/09/2012  2073    28431
24/09/2012  6078    30027
24/09/2012  7097    31914
24/09/2012  8239    35617
24/09/2012  8273    30670
24/09/2012  8367    29084
24/09/2012  8959    31023
24/09/2012  9292    34394
24/09/2012  9532    31255
24/09/2012  9705    18758
24/09/2012  10085   29290
24/09/2012  10220   33230
25/09/2012  2073    28506
25/09/2012  6078    32043
25/09/2012  7042    34953
25/09/2012  7569    30898
25/09/2012  8239    41297
25/09/2012  8273    31012
25/09/2012  8367    29645
25/09/2012  8959    29904
25/09/2012  9532    37875
25/09/2012  9705    13280
25/09/2012  10085   35023
25/09/2012  10220   31359
26/09/2012  2073    28388
26/09/2012  6078    29765
26/09/2012  7097    31561
26/09/2012  7569    29151
26/09/2012  8239    40369
26/09/2012  8367    28174
26/09/2012  8959    26554
26/09/2012  9292    32104
26/09/2012  9532    33194
26/09/2012  9705    30377
26/09/2012  10085   31503
26/09/2012  10220   28310
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  0   0
27/09/2012  2073    28491
27/09/2012  6078    31137
27/09/2012  8239    38403
27/09/2012  8273    31117
27/09/2012  8367    28462
27/09/2012  9292    32387
27/09/2012  9532    23023
27/09/2012  9705    32790
27/09/2012  10085   33460
27/09/2012  10220   31782
28/09/2012  0   161
28/09/2012  2073    28381
28/09/2012  7569    32322
28/09/2012  8239    38362
28/09/2012  8273    30533
28/09/2012  8959    17128
28/09/2012  9292    32484
28/09/2012  9532    18586
28/09/2012  9705    27902
29/09/2012  2073    28583
  1. 以上是具有一百万条记录的数据框样本
  2. 如何根据成本中心按周或月和总和秒列对其进行切片或分组。 *
  3. 我已阅读/尝试了本网站上30篇文章,这些文章通过搜索来实现 列出项目pandas,python,groupby,split,dataframe,一周没有成功。
  4. 我使用的是python 2.7和pandas 0.9。
  5. 我已经阅读了pandas 0.9教程中的时间序列/日期功能部分,但却无法阅读  使任何东西都适用于数据帧。我想使用诸如商业周
  6. 之类的功能

    预期输出

    DateOccurred CostCentre TimeDifference
    2012-03-11            0         500000
    2012-03-11         2073         570000
    2012-03-18            0         650000
    2012-03-18         2073         425000 
    2012-03-25            0         378000
    2012-04-25         2073         480000
    

2 个答案:

答案 0 :(得分:3)

这是一种获取输入(作为文本)并按照您希望的方式对其进行分组的方法。关键是为每个分组使用字典(日期,然后是中心)。

import collections
import datetime
import functools

def delta_totals_by_date_and_centre(in_file):
    # Use a defaultdict instead of a normal dict so that missing values are
    # automatically created. by_date is a mapping (dict) from a tuple of (year, week)
    # to another mapping (dict) from centre to total delta time.
    by_date = collections.defaultdict(functools.partial(collections.defaultdict, int))

    # For each line in the input...
    for line in in_file:
        # Parse the three fields of each line into date, int ,int.
        date, centre, delta = line.split()
        date = datetime.datetime.strptime(date, "%d/%m/%Y").date()
        centre = int(centre)
        delta = int(delta)

        # Determine the year and week of the year.
        year, week, weekday = date.isocalendar()
        year_and_week = year, week

        # Add the time delta.
        by_date[year_and_week][centre] += delta

    # Yield each result, in order.
    for year_and_week, by_centre in sorted(by_date.items()):
        for centre, delta in sorted(by_centre.items()):
            yield year_and_week, centre, delta

对于您的样本输入,它会生成此输出(第一列为year-week_of_the_year)。

2012-36     0      0
2012-36  2073 141208
2012-36  6078 171481
2012-36  7042  27129
2012-36  7569 124600
2012-36  8239  82153
2012-36  8273 154517
2012-36  8367 113339
2012-36  8959  82770
2012-36  9292 128089
2012-36  9532 137491
2012-36  9705 146321
2012-36 10085 151483
2012-36 10220  87496
2012-36 14573    186
2012-37     0  89522
2012-37  2073 113024
2012-37  6078 160871
2012-37  7042  35063
2012-37  7097  30866
2012-37  8239  61744
2012-37  8273 153898
2012-37  8367  93564
2012-37  8959 116727
2012-37  9292 132628
2012-37  9532 121462
2012-37  9705 139992
2012-37 10085 111229
2012-37 10220  91245
2012-38     0      6
2012-38  2073 169599
2012-38  6078 153976
2012-38  7097  34909
2012-38  7569 152958
2012-38  8239 122693
2012-38  8273 119536
2012-38  8367 116157
2012-38  8959  75579
2012-38  9292 128340
2012-38  9532 163278
2012-38  9705  95205
2012-38 10085  94284
2012-38 10220  92318
2012-38 14573    468
2012-39     0    161
2012-39  2073 170780
2012-39  6078 122972
2012-39  7042  34953
2012-39  7097  63475
2012-39  7569  92371
2012-39  8239 194048
2012-39  8273 123332
2012-39  8367 115365
2012-39  8959 104609
2012-39  9292 131369
2012-39  9532 143933
2012-39  9705 123107
2012-39 10085 129276
2012-39 10220 124681

答案 1 :(得分:3)

可能先按CostCentre分组,然后使用Series / DataFrame resample()

In [72]: centers = {}

In [73]: for center, idx in df.groupby("CostCentre").groups.iteritems():
   ....:     timediff = df.ix[idx].set_index("Date")['TimeDifference']
   ....:     centers[center] = timediff.resample("W", how=sum)

In [77]: pd.concat(centers, names=['CostCentre'])
Out[77]: 
CostCentre  Date      
0           2012-09-09         0
            2012-09-16     89522
            2012-09-23         6
            2012-09-30       161
2073        2012-09-09    141208
            2012-09-16    113024
            2012-09-23    169599
            2012-09-30    170780
6078        2012-09-09    171481
            2012-09-16    160871
            2012-09-23    153976
            2012-09-30    122972

其他详细信息:

当pd_read_ *函数的parse_datesTrue时,还必须设置index_col

In [28]: df = pd.read_clipboard(sep=' +', parse_dates=True, index_col=0,
   ....:                        dayfirst=True)

In [30]: df.head()
Out[30]: 
              CostCentre  TimeDifference
DateOccurred                            
2012-09-03          2073           28138
2012-09-03          6078           34844
2012-09-03          8273           31215
2012-09-03          8367           28160
2012-09-03          8959           32037

由于resample()需要TimeSeries索引的帧/系列,因此在创建期间设置索引无需单独设置每个组的索引。 GroupBy对象也有一个apply方法,它基本上是用上面的pd.concat()完成的“组合”步骤的语法糖。

In [37]: x = df.groupby("CostCentre").apply(lambda df: 
   ....:         df['TimeDifference'].resample("W", how=sum))

In [38]: x.head(12)
Out[38]: 
CostCentre  DateOccurred
0           2012-09-09           0
            2012-09-16       89522
            2012-09-23           6
            2012-09-30         161
2073        2012-09-09      141208
            2012-09-16      113024
            2012-09-23      169599
            2012-09-30      170780
6078        2012-09-09      171481
            2012-09-16      160871
            2012-09-23      153976
            2012-09-30      122972