groupby DataFrame,其中新列表示该组

时间:2016-07-24 19:24:33

标签: python pandas

我有一个带有时间戳列的DataFrame

d1=DataFrame({'a':[datetime(2015,1,1,20,2,1),datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5),datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28),datetime(2015,1,1,20,37,51),datetime(2015,1,1,20,41,19),
datetime(2015,1,1,20,49,4),datetime(2015,1,1,20,59,21)], 'b':[2,4,26,22,45,3,8,121,34]})


          a              b
0 2015-01-01 20:02:01    2
1 2015-01-01 20:14:58    4
2 2015-01-01 20:17:05   26
3 2015-01-01 20:31:05   22
4 2015-01-01 20:34:28   45
5 2015-01-01 20:37:51    3
6 2015-01-01 20:41:19    8
7 2015-01-01 20:49:04  121
8 2015-01-01 20:59:21   34

我可以通过15分钟的间隔进行分组

d2=d1.set_index('a')

d3=d2.groupby(pd.TimeGrouper('15Min'))

按组分列的行数由

找到
d3.size()

a
2015-01-01 20:00:00    2
2015-01-01 20:15:00    1
2015-01-01 20:30:00    4
2015-01-01 20:45:00    2

我希望我的原始DataFrame有一列对应于它所属的特定组中的唯一行数。例如,第一组

2015-01-01 20:00:00 

有2行,因此d1中我的新列的前两行应该具有数字1

第二组

2015-01-01 20:15:00 

有1行,所以d1中新列的第三行应该有数字2

第三组

2015-01-01 20:15:00 

有4行,所以d1中新列的第四,第五,第六和第七行应该有数字3

我希望我的新DataFrame看起来像这样

          a              b   c
0 2015-01-01 20:02:01    2   1
1 2015-01-01 20:14:58    4   1
2 2015-01-01 20:17:05   26   2
3 2015-01-01 20:31:05   22   3
4 2015-01-01 20:34:28   45   3
5 2015-01-01 20:37:51    3   3
6 2015-01-01 20:41:19    8   3
7 2015-01-01 20:49:04  121   4
8 2015-01-01 20:59:21   34   4

1 个答案:

答案 0 :(得分:1)

.transform()对象上使用groupby itertools.count迭代器:

from datetime import datetime
from itertools import count
import pandas as pd

d1 = pd.DataFrame({'a': [datetime(2015,1,1,20,2,1), datetime(2015,1,1,20,14,58),
                         datetime(2015,1,1,20,17,5), datetime(2015,1,1,20,31,5),
                         datetime(2015,1,1,20,34,28), datetime(2015,1,1,20,37,51),
                         datetime(2015,1,1,20,41,19), datetime(2015,1,1,20,49,4),
                         datetime(2015,1,1,20,59,21)],
                   'b': [2, 4, 26, 22, 45, 3, 8, 121, 34]})
d2 = d1.set_index('a')

counter = count(1)
d2['c'] = (d2.groupby(pd.TimeGrouper('15Min'))['b']
             .transform(lambda x: next(counter)))
print(d2)

输出:

                       b  c
a                          
2015-01-01 20:02:01    2  1
2015-01-01 20:14:58    4  1
2015-01-01 20:17:05   26  2
2015-01-01 20:31:05   22  3
2015-01-01 20:34:28   45  3
2015-01-01 20:37:51    3  3
2015-01-01 20:41:19    8  3
2015-01-01 20:49:04  121  4
2015-01-01 20:59:21   34  4