如何在python中消除分组对象

时间:2017-10-24 14:30:00

标签: python pandas dataframe

假设我有如下的数据框

customer    day  
1       2016/12/21
1       2016/12/30
1       2017/1/2
2       2016/12/4
2       2016/12/10
3       2017/1/3
3       2017/1/5

我想取消2017/1年后访问过的客户

我想要的结果是

customer   day
2        2016/12/4
2        2016/12/10

我尝试与客户分组。

df.groupby(df.customer)

df[df.day.dt.year<=2017]

但我无法弄清楚如何消除。 我想通过客户需要迭代。

告诉我如何在特定条件下消除顾客。

3 个答案:

答案 0 :(得分:3)

使用filter

In [5653]: df.groupby('customer').filter(lambda x: ~(x.day>'2017/1/1').any())
Out[5653]:
   customer         day
3         2   2016/12/4
4         2  2016/12/10

或者,

In [5654]: df.groupby('customer').filter(lambda x: x.day.le('2017/1/1').all())
Out[5654]:
   customer         day
3         2   2016/12/4
4         2  2016/12/10

答案 1 :(得分:3)

groupby更快解决方案:

df = df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'])]
print (df)
   customer        day
3         2 2016-12-04
4         2 2016-12-10

详情:

print (df.loc[df['day'] > '2017-01-01', 'customer'])
2    1
5    3
6    3
Name: customer, dtype: int64

谢谢,Zero,想法:

df = df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'].unique())]

<强>计时

np.random.seed(123)
N = 10000
L = pd.date_range('2016-01-01', periods=400)
df = pd.DataFrame({'day': np.random.choice(L, N),
                   'customer':np.random.randint(1000, size=N)})
df = df.sort_values([ 'customer','day'])
print (df)

In [357]: %timeit df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'])]
1000 loops, best of 3: 932 µs per loop

In [358]: %timeit df[~df['customer'].isin(df.loc[df['day'] > '2017-01-01', 'customer'].unique())]
1000 loops, best of 3: 987 µs per loop
In [359]: %timeit df.groupby('customer').filter(lambda x: ~(x.day>'2017/1/1').any())
1 loop, best of 3: 397 ms per loop

In [360]: %timeit df.groupby('customer').filter(lambda x: x.day.le('2017/1/1').all())
1 loop, best of 3: 432 ms per loop
In [361]: %timeit df.groupby('customer').filter(lambda x : (x.day<'2017-01-01').all())
1 loop, best of 3: 394 ms per loop

In [362]: %timeit df.loc[~df.customer.isin(df.loc[df.day>'2017-01-01',].customer),:]
1000 loops, best of 3: 1.25 ms per loop

def wen(df):
    df1 = df.drop_duplicates(['customer'],keep='last')

    return df.loc[df.customer.isin(df1.loc[df1.day<'2017-01-01','customer']),:]


In [363]: %timeit (wen(df))
1000 loops, best of 3: 1.81 ms per loop

答案 2 :(得分:1)

方法1:与Jez相同

df.loc[~df.customer.isin(df.loc[df.day>'2017-01-01',].customer),:]
Out[74]: 
   customer        day
3         2 2016-12-04
4         2 2016-12-10

方法2:零方法

df.groupby('customer').filter(lambda x : (x.day<'2017-01-01').all())
Out[79]: 
   customer        day
3         2 2016-12-04
4         2 2016-12-10

方法3

df1=df.drop_duplicates(['customer'],keep='last')

df.loc[df.customer.isin(df1.loc[df1.day<'2017-01-01','customer']),:]

Out[88]: 
   customer        day
3         2 2016-12-04
4         2 2016-12-10