我有一个如下所示的df:
ID Component IDDate EmployeeID CreateUserID
24 1 2017-09-11 00:00:00.000 0907036 Afior
24 2 2017-09-11 00:00:00.000 0907036 Afior
24 3 2017-09-11 00:00:00.000 0907036 Afior
25 1 2017-09-12 00:00:00.000 0907036 Afior
25 3 2017-09-12 00:00:00.000 0907036 Afior
26 8 2017-09-16 00:00:00.000 1013842 JHyde
26 11 2017-09-16 00:00:00.000 1013842 JHyde
26 12 2017-09-16 00:00:00.000 1013842 JHyde
26 23 2017-09-16 00:00:00.000 1013842 JHyde
27 21 2017-09-16 00:00:00.000 0907036 Afior
27 22 2017-09-16 00:00:00.000 0907036 Afior
27 23 2017-09-16 00:00:00.000 0907036 Afior
28 15 2017-10-16 00:00:00.000 1013842 JHyde
28 16 2017-10-16 00:00:00.000 1013842 JHyde
28 19 2017-10-16 00:00:00.000 1013842 JHyde
28 25 2017-10-16 00:00:00.000 1013842 JHyde
28 26 2017-10-16 00:00:00.000 1013842 JHyde
我正在尝试使用cumcount创建一个变量,该变量保存每个ID / EmployeeID组合的观察顺序。我尚未能够按所需的水平应用计数,但是尝试了cumcount()
的各种变体,这些变体并没有使我完全达到我想要的位置,例如:
df['seq'] = df.groupby(['EmployeeID', 'ID', 'Date']).cumcount().add(1)
df['seq'] = df.groupby(['EmployeeID', 'Date']).cumcount().add(1)
df['seq'] = df.groupby(['EmployeeID', 'ID']).cumcount().add(1)
理想情况下,我的输出如下所示:
ID Component IDDate EmployeeID CreateUserID seq
24 1 2017-09-11 00:00:00.000 0907036 Afior 1
24 2 2017-09-11 00:00:00.000 0907036 Afior 1
24 3 2017-09-11 00:00:00.000 0907036 Afior 1
25 1 2017-09-12 00:00:00.000 0907036 Afior 2
25 3 2017-09-12 00:00:00.000 0907036 Afior 2
26 8 2017-09-16 00:00:00.000 1013842 JHyde 1
26 11 2017-09-16 00:00:00.000 1013842 JHyde 1
26 12 2017-09-16 00:00:00.000 1013842 JHyde 1
26 23 2017-09-16 00:00:00.000 1013842 JHyde 1
27 21 2017-09-16 00:00:00.000 0907036 Afior 3
27 22 2017-09-16 00:00:00.000 0907036 Afior 3
27 23 2017-09-16 00:00:00.000 0907036 Afior 3
28 15 2017-10-16 00:00:00.000 1013842 JHyde 2
28 16 2017-10-16 00:00:00.000 1013842 JHyde 2
28 19 2017-10-16 00:00:00.000 1013842 JHyde 2
28 25 2017-10-16 00:00:00.000 1013842 JHyde 2
28 26 2017-10-16 00:00:00.000 1013842 JHyde 2
是否有一种方法可以处理使我得到此输出的欺骗?首先使df变宽然后再应用cumcount()
会更好吗?
答案 0 :(得分:1)
如果我正确理解,则会将其转换为分类数据并获取codes
df[['IDDate','EmployeeID']].apply(tuple,1).groupby(df['CreateUserID']).apply(lambda x : x.astype('category').cat.codes+1)
Out[8]:
0 1
1 1
2 1
3 2
4 2
5 1
6 1
7 1
8 1
9 3
10 3
11 3
12 2
13 2
14 2
15 2
16 2
dtype: int8
答案 1 :(得分:1)
这是一种方法,本质上仅按EmployeeID
进行分组,然后检查ID
是否从一行更改为另一行,然后返回该行的cumsum
(这是基于您的尝试和所需的输出)。
df['seq'] = df.groupby('EmployeeID')['ID'].transform(lambda x: x.ne(x.shift()).cumsum())
>>> df
ID Component IDDate EmployeeID CreateUserID seq
0 24 1 2017-09-11 00:00:00.000 907036 Afior 1
1 24 2 2017-09-11 00:00:00.000 907036 Afior 1
2 24 3 2017-09-11 00:00:00.000 907036 Afior 1
3 25 1 2017-09-12 00:00:00.000 907036 Afior 2
4 25 3 2017-09-12 00:00:00.000 907036 Afior 2
5 26 8 2017-09-16 00:00:00.000 1013842 JHyde 1
6 26 11 2017-09-16 00:00:00.000 1013842 JHyde 1
7 26 12 2017-09-16 00:00:00.000 1013842 JHyde 1
8 26 23 2017-09-16 00:00:00.000 1013842 JHyde 1
9 27 21 2017-09-16 00:00:00.000 907036 Afior 3
10 27 22 2017-09-16 00:00:00.000 907036 Afior 3
11 27 23 2017-09-16 00:00:00.000 907036 Afior 3
12 28 15 2017-10-16 00:00:00.000 1013842 JHyde 2
13 28 16 2017-10-16 00:00:00.000 1013842 JHyde 2
14 28 19 2017-10-16 00:00:00.000 1013842 JHyde 2
15 28 25 2017-10-16 00:00:00.000 1013842 JHyde 2
16 28 26 2017-10-16 00:00:00.000 1013842 JHyde 2
答案 2 :(得分:1)
另一种方法是对EmployeeID进行分组,然后对Date进行密集排名:
In [187]: df.groupby("EmployeeID")["Date"].apply(lambda x: x.rank(method='dense')).astype(int)
Out[187]:
0 1
1 1
2 1
3 2
4 2
5 1
6 1
7 1
8 1
9 3
10 3
11 3
12 2
13 2
14 2
15 2
16 2
Name: Date, dtype: int64
这将按值而不是按先见顺序进行排名,尽管如果按示例中的日期进行排序就没有关系。