我有一个pandas groupby数据框,如下所示,并在page
上分组:
+---------+------+-------+
| page | hour | count |
+---------+------+-------+
| 3727441 | 1 | 2003 |
| 3727441 | 2 | 654 |
| 3727441 | 3 | 5434 |
| 3727458 | 1 | 326 |
| 3727458 | 2 | 2348 |
| 3727458 | 3 | 4040 |
| 3727458 | 4 | 374 |
| 3727458 | 5 | 2917 |
| 3727458 | 6 | 3937 |
| 3735634 | 1 | 1957 |
| 3735634 | 2 | 2398 |
| 3735634 | 3 | 2812 |
| 3768433 | 1 | 499 |
| 3768433 | 2 | 4924 |
| 3768433 | 3 | 5460 |
| 3768433 | 4 | 1710 |
| 3768433 | 5 | 3877 |
| 3768433 | 6 | 1912 |
| 3768433 | 7 | 1367 |
| 3768433 | 8 | 1626 |
| 3768433 | 9 | 4750 |
+---------+------+-------+
您会注意到某些组按小时计算有3,6或9行。我想做的是将每个组最多切成3小时,然后在6和9个长度组中添加一些内容,表示它与下面的页面相同:
+-----------+------+-------+
| page | hour | count |
+-----------+------+-------+
| 3727441 | 1 | 2003 |
| 3727441 | 2 | 654 |
| 3727441 | 3 | 5434 |
| 3727458 | 1 | 326 |
| 3727458 | 2 | 2348 |
| 3727458 | 3 | 4040 |
| 3727458_1 | 4 | 374 |
| 3727458_1 | 5 | 2917 |
| 3727458_1 | 6 | 3937 |
| 3735634 | 1 | 1957 |
| 3735634 | 2 | 2398 |
| 3735634 | 3 | 2812 |
| 3768433 | 1 | 499 |
| 3768433 | 2 | 4924 |
| 3768433 | 3 | 5460 |
| 3768433_1 | 4 | 1710 |
| 3768433_1 | 5 | 3877 |
| 3768433_1 | 6 | 1912 |
| 3768433_2 | 7 | 1367 |
| 3768433_2 | 8 | 1626 |
| 3768433_2 | 9 | 4750 |
+-----------+------+-------+
我尝试使用enumerate
开始执行此操作:
for name, group in hourly_groups:
for i, x in enumerate(group):
print x
但它没有返回正确的组。
还尝试了以下内容:
for k, g in df_hourly.groupby(df_hourly['page'] - 3):
print g
编辑:
我被截断了,我的数据认为它更容易理解,但是给出的解决方案似乎不适用于我的实际数据集。以下是实际数据集的示例,其中页面3694750
是需要分成34个组的页面示例。http://www.sharecsv.com/s/b2dbe8e49d6a2481de138f6ca06c679e/test.csv
答案 0 :(得分:2)
这就是您想要的,使用df.apply方法
import pandas as pd
cols = ['page', 'hour', 'count']
data = [
(3727441, 1, 2003),
(3727441, 2, 654),
(3727441, 3, 5434),
(3727458, 1, 326),
(3727458, 2, 2348),
(3727458, 3, 4040),
(3727458, 4, 374),
(3727458, 5, 2917),
(3727458, 6, 3937),
(3735634, 1, 1957),
(3735634, 2, 2398),
(3735634, 3, 2812),
(3768433, 1, 499),
(3768433, 2, 4924),
(3768433, 3, 5460),
(3768433, 4, 1710),
(3768433, 5, 3877),
(3768433, 6, 1912),
(3768433, 7, 1367),
(3768433, 8, 1626),
(3768433, 9, 4750),
]
df = pd.DataFrame.from_records(data, columns=cols)
def f(row):
n = (row.hour - 1) / 3
if n > 0:
return str(row.page) + '_{0}'.format(int(n))
else:
return row.page
df['page'] = df.apply(f, axis=1)
print df
输出:
# page hour count
# 0 3727441 1 2003
# 1 3727441 2 654
# 2 3727441 3 5434
# 3 3727458 1 326
# 4 3727458 2 2348
# 5 3727458 3 4040
# 6 3727458_1 4 374
# 7 3727458_1 5 2917
# 8 3727458_1 6 3937
# 9 3735634 1 1957
# 10 3735634 2 2398
# 11 3735634 3 2812
# 12 3768433 1 499
# 13 3768433 2 4924
# 14 3768433 3 5460
# 15 3768433_1 4 1710
# 16 3768433_1 5 3877
# 17 3768433_1 6 1912
# 18 3768433_2 7 1367
# 19 3768433_2 8 1626
# 20 3768433_2 9 4750
答案 1 :(得分:2)
如何使用//
运算符进行整数除法?
In [164]:
df.page.astype(str)+np.where(df.hour>3,
'_'+((df.hour.astype(int)-1)//3).astype(str),
'')
#overwrite df['page'] with this
Out[164]:
0 3727441
1 3727441
2 3727441
3 3727458
4 3727458
5 3727458
6 3727458_1
7 3727458_1
8 3727458_1
9 3735634
10 3735634
11 3735634
12 3768433
13 3768433
14 3768433
15 3768433_1
16 3768433_1
17 3768433_1
18 3768433_2
19 3768433_2
20 3768433_2
Name: page, dtype: object
答案 2 :(得分:1)
看来你想在groupby结果上重新标记你的索引(我假设它被命名为`hourly_groups')
hourly_groups.reset_index(inplace=True)
hourly_groups['page'] = hourly_groups.page.apply(lambda x: str(x)) + hourly_groups.hour.apply(lambda x: '_1' if 3 < x <= 6 else ('_2' if x > 6 else ""))
hourly_groups.set_index(['page', 'hour'], inplace=True)
>>> hourly_groups
count
page hour
3727441 1 2003
2 654
3 5434
3727458 1 326
2 2348
3 4040
3727458_1 4 374
5 2917
6 3937
3735634 1 1957
2 2398
3 2812
3768433 1 499
2 4924
3 5460
3768433_1 4 1710
5 3877
6 1912
3768433_2 7 1367
8 1626
9 4750