我正在尝试创建一个新变量,该变量计算随着时间的推移看到相同ID的次数。
需要从此数据框传递
id clae6 year quarter
1 475230.0 2007 1
1 475230.0 2007 2
1 475230.0 2007 3
1 475230.0 2007 4
1 475230.0 2008 1
1 475230.0 2008 2
2 475230.0 2007 1
2 475230.0 2007 2
2 475230.0 2007 3
2 475230.0 2007 4
2 475230.0 2008 1
3 475230.0 2010 1
3 475230.0 2010 2
3 475230.0 2010 3
3 475230.0 2010 4
到这个
id clae6 year quarter new_variable
1 475230.0 2007 1 1
1 475230.0 2007 2 2
1 475230.0 2007 3 3
1 475230.0 2007 4 4
1 475230.0 2008 1 5
1 475230.0 2008 2 6
2 475230.0 2007 1 1
2 475230.0 2007 2 2
2 475230.0 2007 3 3
2 475230.0 2007 4 4
2 475230.0 2008 1 5
3 475230.0 2010 1 1
3 475230.0 2010 2 2
3 475230.0 2010 3 3
3 475230.0 2010 4 4
我正在使用以下代码,但也许还有一个更容易(我操作了很多记录,所以我正在寻找更快的代码):
df['control'] = 1
df['new_variable'] = df.groupby(['id'])['control'].cumsum()
答案 0 :(得分:2)
您可以使用等级
df['new'] = df.groupby('id').rank(method = 'first').astype(int)
id clae6 year quarter new
0 1 475230.0 2007 1 1
1 1 475230.0 2007 2 2
2 1 475230.0 2007 3 3
3 1 475230.0 2007 4 4
4 1 475230.0 2008 1 5
5 1 475230.0 2008 2 6
6 2 475230.0 2007 1 1
7 2 475230.0 2007 2 2
8 2 475230.0 2007 3 3
9 2 475230.0 2007 4 4
10 2 475230.0 2008 1 5
11 3 475230.0 2010 1 1
12 3 475230.0 2010 2 2
13 3 475230.0 2010 3 3
14 3 475230.0 2010 4 4
答案 1 :(得分:2)
使用cumcount
df.groupby('id').cumcount().add(1)
Out[1574]:
0 1
1 2
2 3
3 4
4 5
5 6
6 1
7 2
8 3
9 4
10 5
11 1
12 2
13 3
14 4
dtype: int64