我有一个带有客户信息及其购买详细信息的数据框。我正在尝试添加一个新列,以指示同一位客户进行的每3次购买。
下面是数据框
customer_name,bill_no,date
Mark,101,2018-10-01
Scott,102,2018-10-01
Pete,103,2018-10-02
Mark,104,2018-10-02
Mark,105,2018-10-04
Scott,106,2018-10-21
Julie,107,2018-10-03
Kevin,108,2018-10-07
Steve,109,2018-10-02
Mark,110,2018-10-06
Mark,111,2018-10-02
Mark,112,2018-10-05
Mark,113,2018-10-05
我写此邮件是为了过滤同一位客户进行的每3次购买。因此,在这种情况下,我想为下面的bill_no
添加一个标志Mark,105,2018-10-04
Mark,112,2018-10-05
基本上为同一位客户生成的3账单的每倍数。
答案 0 :(得分:5)
n = 3
df['flag'] = df.groupby('customer_name').cumcount() + 1
df['flag'] = ((df['flag'] % n) == 0).astype(int)
print(df)
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
答案 1 :(得分:1)
如果实际上获取索引很重要,则应该对索引切片使用groupby
+ apply
:
n = 3
idx = df.groupby('customer_name', group_keys=False).apply(
lambda x: x.index[n-1::n].to_series())
# So you can query these rows easily.
df.loc[idx]
customer_name bill_no date
4 Mark 105 2018-10-04
11 Mark 112 2018-10-05
现在,使用索引标记它们:
df['flag'] = 0
df.loc[idx, 'flag'] = 1
df
customer_name bill_no date flag
0 Mark 101 2018-10-01 0
1 Scott 102 2018-10-01 0
2 Pete 103 2018-10-02 0
3 Mark 104 2018-10-02 0
4 Mark 105 2018-10-04 1
5 Scott 106 2018-10-21 0
6 Julie 107 2018-10-03 0
7 Kevin 108 2018-10-07 0
8 Steve 109 2018-10-02 0
9 Mark 110 2018-10-06 0
10 Mark 111 2018-10-02 0
11 Mark 112 2018-10-05 1
12 Mark 113 2018-10-05 0
如果性能很重要,请改用Sandeep的解决方案。