将pandas groupby group切成相等的长度

时间:2015-08-31 16:58:45

标签: python pandas

我有一个pandas groupby数据框,如下所示,并在page上分组:

+---------+------+-------+
|  page   | hour | count |
+---------+------+-------+
| 3727441 |    1 |  2003 |
| 3727441 |    2 |   654 |
| 3727441 |    3 |  5434 |
| 3727458 |    1 |   326 |
| 3727458 |    2 |  2348 |
| 3727458 |    3 |  4040 |
| 3727458 |    4 |   374 |
| 3727458 |    5 |  2917 |
| 3727458 |    6 |  3937 |
| 3735634 |    1 |  1957 |
| 3735634 |    2 |  2398 |
| 3735634 |    3 |  2812 |
| 3768433 |    1 |   499 |
| 3768433 |    2 |  4924 |
| 3768433 |    3 |  5460 |
| 3768433 |    4 |  1710 |
| 3768433 |    5 |  3877 |
| 3768433 |    6 |  1912 |
| 3768433 |    7 |  1367 |
| 3768433 |    8 |  1626 |
| 3768433 |    9 |  4750 |
+---------+------+-------+

您会注意到某些组按小时计算有3,6或9行。我想做的是将每个组最多切成3小时,然后在6和9个长度组中添加一些内容,表示它与下面的页面相同:

+-----------+------+-------+
|   page    | hour | count |
+-----------+------+-------+
| 3727441   |    1 |  2003 |
| 3727441   |    2 |   654 |
| 3727441   |    3 |  5434 |
| 3727458   |    1 |   326 |
| 3727458   |    2 |  2348 |
| 3727458   |    3 |  4040 |
| 3727458_1 |    4 |   374 |
| 3727458_1 |    5 |  2917 |
| 3727458_1 |    6 |  3937 |
| 3735634   |    1 |  1957 |
| 3735634   |    2 |  2398 |
| 3735634   |    3 |  2812 |
| 3768433   |    1 |   499 |
| 3768433   |    2 |  4924 |
| 3768433   |    3 |  5460 |
| 3768433_1 |    4 |  1710 |
| 3768433_1 |    5 |  3877 |
| 3768433_1 |    6 |  1912 |
| 3768433_2 |    7 |  1367 |
| 3768433_2 |    8 |  1626 |
| 3768433_2 |    9 |  4750 |
+-----------+------+-------+

我尝试使用enumerate开始执行此操作:

for name, group in hourly_groups:
        for i, x in enumerate(group):
            print x

但它没有返回正确的组。

还尝试了以下内容:

for k, g in df_hourly.groupby(df_hourly['page'] - 3):
    print g

编辑:

我被截断了,我的数据认为它更容易理解,但是给出的解决方案似乎不适用于我的实际数据集。以下是实际数据集的示例,其中页面3694750是需要分成34个组的页面示例。http://www.sharecsv.com/s/b2dbe8e49d6a2481de138f6ca06c679e/test.csv

3 个答案:

答案 0 :(得分:2)

这就是您想要的,使用df.apply方法

import pandas as pd

cols = ['page', 'hour', 'count']
data = [
    (3727441,    1,  2003),
    (3727441,    2,   654),
    (3727441,    3,  5434),
    (3727458,    1,   326),
    (3727458,    2,  2348),
    (3727458,    3,  4040),
    (3727458,    4,   374),
    (3727458,    5,  2917),
    (3727458,    6,  3937),
    (3735634,    1,  1957),
    (3735634,    2,  2398),
    (3735634,    3,  2812),
    (3768433,    1,   499),
    (3768433,    2,  4924),
    (3768433,    3,  5460),
    (3768433,    4,  1710),
    (3768433,    5,  3877),
    (3768433,    6,  1912),
    (3768433,    7,  1367),
    (3768433,    8,  1626),
    (3768433,    9,  4750),
]

df = pd.DataFrame.from_records(data, columns=cols)

def f(row):
    n = (row.hour - 1) / 3 
    if n > 0:
        return str(row.page) + '_{0}'.format(int(n))
    else:
        return row.page

df['page'] = df.apply(f, axis=1)

print df

输出:

 #       page  hour  count
 # 0     3727441     1   2003
 # 1     3727441     2    654
 # 2     3727441     3   5434
 # 3     3727458     1    326
 # 4     3727458     2   2348
 # 5     3727458     3   4040
 # 6   3727458_1     4    374
 # 7   3727458_1     5   2917
 # 8   3727458_1     6   3937
 # 9     3735634     1   1957
 # 10    3735634     2   2398
 # 11    3735634     3   2812
 # 12    3768433     1    499
 # 13    3768433     2   4924
 # 14    3768433     3   5460
 # 15  3768433_1     4   1710
 # 16  3768433_1     5   3877
 # 17  3768433_1     6   1912
 # 18  3768433_2     7   1367
 # 19  3768433_2     8   1626
 # 20  3768433_2     9   4750

答案 1 :(得分:2)

如何使用//运算符进行整数除法?

In [164]:

df.page.astype(str)+np.where(df.hour>3, 
                             '_'+((df.hour.astype(int)-1)//3).astype(str),
                             '')
#overwrite df['page'] with this
Out[164]:
0       3727441
1       3727441
2       3727441
3       3727458
4       3727458
5       3727458
6     3727458_1
7     3727458_1
8     3727458_1
9       3735634
10      3735634
11      3735634
12      3768433
13      3768433
14      3768433
15    3768433_1
16    3768433_1
17    3768433_1
18    3768433_2
19    3768433_2
20    3768433_2
Name: page, dtype: object

答案 2 :(得分:1)

看来你想在groupby结果上重新标记你的索引(我假设它被命名为`hourly_groups')

hourly_groups.reset_index(inplace=True)
hourly_groups['page'] = hourly_groups.page.apply(lambda x: str(x)) + hourly_groups.hour.apply(lambda x: '_1' if 3 < x <= 6 else ('_2' if x > 6 else ""))
hourly_groups.set_index(['page', 'hour'], inplace=True)

>>> hourly_groups
                count
page      hour       
3727441   1      2003
          2       654
          3      5434
3727458   1       326
          2      2348
          3      4040
3727458_1 4       374
          5      2917
          6      3937
3735634   1      1957
          2      2398
          3      2812
3768433   1       499
          2      4924
          3      5460
3768433_1 4      1710
          5      3877
          6      1912
3768433_2 7      1367
          8      1626
          9      4750