如何计算连续的条目数并在字段更改时重置

时间:2019-04-15 14:49:54

标签: python pandas

我正在尝试计算每个用户连续打开的电子邮件数量。我有按电子邮件地址和日期排序的数据,并且可以计算连续打开的#号,但是我不知道如何在有新的电子邮件地址时将其重置为0。

这是我到目前为止所拥有的。这确实会计算连续打开的数字,但是当有新的电子邮件地址时,它不会重置为0。

in_a_row = []
count = 0

for row in merge['Opened?']:
    if row == 1:
        count += 1
        in_a_row.append(count)
    elif row == 0:
        count = 0
        in_a_row.append(count)
merged['in_a_row'] = in_a_row

这是目前的样子

Index   email_address   sent_date      sent_rank  Opened?   in_a_row
0   email_A@gmail.com   5/15/2018          1          1         1
1   email_A@gmail.com   5/23/2018          2          0         0
2   email_A@gmail.com   5/23/2018          3          1         1
3   email_B@gmail.com   5/26/2018          1          1         2
4   email_B@gmail.com   5/27/2018          2          1         3
5   email_B@gmail.com   8/2/2018           3          0         0
6   email_B@gmail.com   8/3/2018           4          1         1
7   email_B@gmail.com   12/12/2018         5          1         2
8   email_C@gmail.com   12/12/2018         1          1         3
9   email_C@gmail.com   2/6/2019           2          0         0
10  email_C@gmail.com   2/12/2019          3          1         1

这应该是它的样子

Index   email_address   sent_date      sent_rank  Opened?   in_a_row
0   email_A@gmail.com   5/15/2018          1          1         1
1   email_A@gmail.com   5/23/2018          2          0         0
2   email_A@gmail.com   5/23/2018          3          1         1
3   email_B@gmail.com   5/26/2018          1          1         1
4   email_B@gmail.com   5/27/2018          2          1         2
5   email_B@gmail.com   8/2/2018           3          0         0
6   email_B@gmail.com   8/3/2018           4          1         1
7   email_B@gmail.com   12/12/2018         5          1         2
8   email_C@gmail.com   12/12/2018         1          1         1
9   email_C@gmail.com   2/6/2019           2          0         0
10  email_C@gmail.com   2/12/2019          3          1         1

1 个答案:

答案 0 :(得分:0)

使用groupby.transform和使用.ne!=),.shift.cumsum.add的lambda来尝试:

g = df.groupby('email_address')
df['in_a_row'] = g['Opened?'].transform(lambda x: x * (x.groupby((x.ne(x.shift())).cumsum()).cumcount().add(x)))

注意:我认为您期望的输出中可能仍然存在一些错字。例如,idx 89输入和输出的Opened?

具有不同的值

[输出]

    Index      email_address   sent_date  sent_rank  Opened?  in_a_row
0       0  email_A@gmail.com   5/15/2018          1        1         1
1       1  email_A@gmail.com   5/23/2018          2        0         0
2       2  email_A@gmail.com   5/23/2018          3        1         1
3       3  email_B@gmail.com   5/26/2018          1        1         1
4       4  email_B@gmail.com   5/27/2018          2        1         2
5       5  email_B@gmail.com    8/2/2018          3        0         0
6       6  email_B@gmail.com    8/3/2018          4        1         1
7       7  email_B@gmail.com  12/12/2018          5        1         2
8       8  email_C@gmail.com  12/12/2018          1        1         1
9       9  email_C@gmail.com    2/6/2019          2        0         0
10     10  email_C@gmail.com   2/12/2019          3        1         1