将cumcount()与dups一起使用

时间:2018-07-23 16:49:17

标签: python python-3.x pandas pandas-groupby

我有一个如下所示的df:

ID Component IDDate                   EmployeeID CreateUserID
24 1         2017-09-11 00:00:00.000  0907036    Afior
24 2         2017-09-11 00:00:00.000  0907036    Afior
24 3         2017-09-11 00:00:00.000  0907036    Afior
25 1         2017-09-12 00:00:00.000  0907036    Afior
25 3         2017-09-12 00:00:00.000  0907036    Afior
26 8         2017-09-16 00:00:00.000  1013842    JHyde
26 11        2017-09-16 00:00:00.000  1013842    JHyde
26 12        2017-09-16 00:00:00.000  1013842    JHyde
26 23        2017-09-16 00:00:00.000  1013842    JHyde
27 21        2017-09-16 00:00:00.000  0907036    Afior
27 22        2017-09-16 00:00:00.000  0907036    Afior
27 23        2017-09-16 00:00:00.000  0907036    Afior
28 15        2017-10-16 00:00:00.000  1013842    JHyde
28 16        2017-10-16 00:00:00.000  1013842    JHyde
28 19        2017-10-16 00:00:00.000  1013842    JHyde
28 25        2017-10-16 00:00:00.000  1013842    JHyde
28 26        2017-10-16 00:00:00.000  1013842    JHyde

我正在尝试使用cumcount创建一个变量,该变量保存每个ID / EmployeeID组合的观察顺序。我尚未能够按所需的水平应用计数,但是尝试了cumcount()的各种变体,这些变体并没有使我完全达到我想要的位置,例如:

df['seq'] = df.groupby(['EmployeeID', 'ID', 'Date']).cumcount().add(1)

df['seq'] = df.groupby(['EmployeeID', 'Date']).cumcount().add(1)

df['seq'] = df.groupby(['EmployeeID', 'ID']).cumcount().add(1)

理想情况下,我的输出如下所示:

ID Component IDDate                   EmployeeID CreateUserID seq
24 1         2017-09-11 00:00:00.000  0907036    Afior        1
24 2         2017-09-11 00:00:00.000  0907036    Afior        1
24 3         2017-09-11 00:00:00.000  0907036    Afior        1
25 1         2017-09-12 00:00:00.000  0907036    Afior        2
25 3         2017-09-12 00:00:00.000  0907036    Afior        2
26 8         2017-09-16 00:00:00.000  1013842    JHyde        1
26 11        2017-09-16 00:00:00.000  1013842    JHyde        1
26 12        2017-09-16 00:00:00.000  1013842    JHyde        1
26 23        2017-09-16 00:00:00.000  1013842    JHyde        1
27 21        2017-09-16 00:00:00.000  0907036    Afior        3
27 22        2017-09-16 00:00:00.000  0907036    Afior        3
27 23        2017-09-16 00:00:00.000  0907036    Afior        3
28 15        2017-10-16 00:00:00.000  1013842    JHyde        2
28 16        2017-10-16 00:00:00.000  1013842    JHyde        2
28 19        2017-10-16 00:00:00.000  1013842    JHyde        2
28 25        2017-10-16 00:00:00.000  1013842    JHyde        2
28 26        2017-10-16 00:00:00.000  1013842    JHyde        2

是否有一种方法可以处理使我得到此输出的欺骗?首先使df变宽然后再应用cumcount()会更好吗?

3 个答案:

答案 0 :(得分:1)

如果我正确理解,则会将其转换为分类数据并获取codes

df[['IDDate','EmployeeID']].apply(tuple,1).groupby(df['CreateUserID']).apply(lambda x : x.astype('category').cat.codes+1)
Out[8]: 
0     1
1     1
2     1
3     2
4     2
5     1
6     1
7     1
8     1
9     3
10    3
11    3
12    2
13    2
14    2
15    2
16    2
dtype: int8

答案 1 :(得分:1)

这是一种方法,本质上仅按EmployeeID进行分组,然后检查ID是否从一行更改为另一行,然后返回该行的cumsum(这是基于您的尝试和所需的输出)。

df['seq'] = df.groupby('EmployeeID')['ID'].transform(lambda x: x.ne(x.shift()).cumsum())

>>> df
    ID  Component                   IDDate  EmployeeID CreateUserID  seq
0   24          1  2017-09-11 00:00:00.000      907036        Afior    1
1   24          2  2017-09-11 00:00:00.000      907036        Afior    1
2   24          3  2017-09-11 00:00:00.000      907036        Afior    1
3   25          1  2017-09-12 00:00:00.000      907036        Afior    2
4   25          3  2017-09-12 00:00:00.000      907036        Afior    2
5   26          8  2017-09-16 00:00:00.000     1013842        JHyde    1
6   26         11  2017-09-16 00:00:00.000     1013842        JHyde    1
7   26         12  2017-09-16 00:00:00.000     1013842        JHyde    1
8   26         23  2017-09-16 00:00:00.000     1013842        JHyde    1
9   27         21  2017-09-16 00:00:00.000      907036        Afior    3
10  27         22  2017-09-16 00:00:00.000      907036        Afior    3
11  27         23  2017-09-16 00:00:00.000      907036        Afior    3
12  28         15  2017-10-16 00:00:00.000     1013842        JHyde    2
13  28         16  2017-10-16 00:00:00.000     1013842        JHyde    2
14  28         19  2017-10-16 00:00:00.000     1013842        JHyde    2
15  28         25  2017-10-16 00:00:00.000     1013842        JHyde    2
16  28         26  2017-10-16 00:00:00.000     1013842        JHyde    2

答案 2 :(得分:1)

另一种方法是对EmployeeID进行分组,然后对Date进行密集排名:

In [187]: df.groupby("EmployeeID")["Date"].apply(lambda x: x.rank(method='dense')).astype(int)
Out[187]: 
0     1
1     1
2     1
3     2
4     2
5     1
6     1
7     1
8     1
9     3
10    3
11    3
12    2
13    2
14    2
15    2
16    2
Name: Date, dtype: int64

这将按值而不是按先见顺序进行排名,尽管如果按示例中的日期进行排序就没有关系。