我正在尝试计算每个用户连续打开的电子邮件数量。我有按电子邮件地址和日期排序的数据,并且可以计算连续打开的#号,但是我不知道如何在有新的电子邮件地址时将其重置为0。
这是我到目前为止所拥有的。这确实会计算连续打开的数字,但是当有新的电子邮件地址时,它不会重置为0。
in_a_row = []
count = 0
for row in merge['Opened?']:
if row == 1:
count += 1
in_a_row.append(count)
elif row == 0:
count = 0
in_a_row.append(count)
merged['in_a_row'] = in_a_row
这是目前的样子
Index email_address sent_date sent_rank Opened? in_a_row
0 email_A@gmail.com 5/15/2018 1 1 1
1 email_A@gmail.com 5/23/2018 2 0 0
2 email_A@gmail.com 5/23/2018 3 1 1
3 email_B@gmail.com 5/26/2018 1 1 2
4 email_B@gmail.com 5/27/2018 2 1 3
5 email_B@gmail.com 8/2/2018 3 0 0
6 email_B@gmail.com 8/3/2018 4 1 1
7 email_B@gmail.com 12/12/2018 5 1 2
8 email_C@gmail.com 12/12/2018 1 1 3
9 email_C@gmail.com 2/6/2019 2 0 0
10 email_C@gmail.com 2/12/2019 3 1 1
这应该是它的样子
Index email_address sent_date sent_rank Opened? in_a_row
0 email_A@gmail.com 5/15/2018 1 1 1
1 email_A@gmail.com 5/23/2018 2 0 0
2 email_A@gmail.com 5/23/2018 3 1 1
3 email_B@gmail.com 5/26/2018 1 1 1
4 email_B@gmail.com 5/27/2018 2 1 2
5 email_B@gmail.com 8/2/2018 3 0 0
6 email_B@gmail.com 8/3/2018 4 1 1
7 email_B@gmail.com 12/12/2018 5 1 2
8 email_C@gmail.com 12/12/2018 1 1 1
9 email_C@gmail.com 2/6/2019 2 0 0
10 email_C@gmail.com 2/12/2019 3 1 1
答案 0 :(得分:0)
使用groupby.transform
和使用.ne
(!=
),.shift
,.cumsum
和.add
的lambda来尝试:
g = df.groupby('email_address')
df['in_a_row'] = g['Opened?'].transform(lambda x: x * (x.groupby((x.ne(x.shift())).cumsum()).cumcount().add(x)))
注意:我认为您期望的输出中可能仍然存在一些错字。例如,idx 8
和9
输入和输出的Opened?
[输出]
Index email_address sent_date sent_rank Opened? in_a_row
0 0 email_A@gmail.com 5/15/2018 1 1 1
1 1 email_A@gmail.com 5/23/2018 2 0 0
2 2 email_A@gmail.com 5/23/2018 3 1 1
3 3 email_B@gmail.com 5/26/2018 1 1 1
4 4 email_B@gmail.com 5/27/2018 2 1 2
5 5 email_B@gmail.com 8/2/2018 3 0 0
6 6 email_B@gmail.com 8/3/2018 4 1 1
7 7 email_B@gmail.com 12/12/2018 5 1 2
8 8 email_C@gmail.com 12/12/2018 1 1 1
9 9 email_C@gmail.com 2/6/2019 2 0 0
10 10 email_C@gmail.com 2/12/2019 3 1 1