假设我有如下的数据框
customer day
1 2016/12/21
1 2016/12/30
1 2017/1/2
2 2016/12/4
2 2016/12/10
3 2017/1/3
3 2017/1/5
我想取消2017/1年后访问过的客户
我想要的结果是
customer day
2 2016/12/4
2 2016/12/10
我尝试与客户分组。
df.groupby(df.customer)
或
df[df.day.dt.year<=2017]
但我无法弄清楚如何消除。 我想通过客户需要迭代。
告诉我如何在特定条件下消除顾客。
答案 0 :(得分:3)
使用filter
In [5653]: df.groupby('customer').filter(lambda x: ~(x.day>'2017/1/1').any())
Out[5653]:
customer day
3 2 2016/12/4
4 2 2016/12/10
或者,
In [5654]: df.groupby('customer').filter(lambda x: x.day.le('2017/1/1').all())
Out[5654]:
customer day
3 2 2016/12/4
4 2 2016/12/10
答案 1 :(得分:3)
非groupby
更快解决方案:
df = df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'])]
print (df)
customer day
3 2 2016-12-04
4 2 2016-12-10
详情:
print (df.loc[df['day'] > '2017-01-01', 'customer'])
2 1
5 3
6 3
Name: customer, dtype: int64
谢谢,Zero
,想法:
df = df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'].unique())]
<强>计时强>:
np.random.seed(123)
N = 10000
L = pd.date_range('2016-01-01', periods=400)
df = pd.DataFrame({'day': np.random.choice(L, N),
'customer':np.random.randint(1000, size=N)})
df = df.sort_values([ 'customer','day'])
print (df)
In [357]: %timeit df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'])]
1000 loops, best of 3: 932 µs per loop
In [358]: %timeit df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'].unique())]
1000 loops, best of 3: 987 µs per loop
In [359]: %timeit df.groupby('customer').filter(lambda x: ~(x.day>'2017/1/1').any())
1 loop, best of 3: 397 ms per loop
In [360]: %timeit df.groupby('customer').filter(lambda x: x.day.le('2017/1/1').all())
1 loop, best of 3: 432 ms per loop
In [361]: %timeit df.groupby('customer').filter(lambda x : (x.day<'2017-01-01').all())
1 loop, best of 3: 394 ms per loop
In [362]: %timeit df.loc[~df.customer.isin(df.loc[df.day>'2017-01-01',].customer),:]
1000 loops, best of 3: 1.25 ms per loop
def wen(df):
df1 = df.drop_duplicates(['customer'],keep='last')
return df.loc[df.customer.isin(df1.loc[df1.day<'2017-01-01','customer']),:]
In [363]: %timeit (wen(df))
1000 loops, best of 3: 1.81 ms per loop
答案 2 :(得分:1)
方法1:与Jez相同
df.loc[~df.customer.isin(df.loc[df.day>'2017-01-01',].customer),:]
Out[74]:
customer day
3 2 2016-12-04
4 2 2016-12-10
方法2:零方法
df.groupby('customer').filter(lambda x : (x.day<'2017-01-01').all())
Out[79]:
customer day
3 2 2016-12-04
4 2 2016-12-10
方法3
df1=df.drop_duplicates(['customer'],keep='last')
df.loc[df.customer.isin(df1.loc[df1.day<'2017-01-01','customer']),:]
Out[88]:
customer day
3 2 2016-12-04
4 2 2016-12-10