使用Pandas,我如何过滤我的Dataframe,以便只有当天的事务总数> N秀?
import pandas as pd
data = [
["2017-01-01 00:00:01.012345", 'Jen', 1.01],
["2017-01-01 01:00:00.012345", 'Joe', 3.02],
["2017-02-01 00:00:00.012345", 'Jen', 2.02],
["2017-02-01 02:00:00.012345", 'Joe', 0.02],
["2017-03-01 03:00:00.012345", 'Jen', 3.02],
["2017-03-01 04:00:00.012345", 'Joe', 4.04],
["2017-03-01 05:00:01.012345", 'Jen', 5.01]]
df = pd.DataFrame({
'trx_time': list(zip(*data))[0],
'agent': list(zip(*data))[1],
'trx_amount': list(zip(*data))[2]})
df['day'] = df['trx_time'].apply(lambda x: pd.to_datetime(x).date())
grouped = df.groupby(['day', 'agent'])
by_day_df = grouped.aggregate('sum')\
.rename(columns = lambda x: 'day_tl_' + x)\
.join(pd.DataFrame(grouped.size(), columns=['trx_count']))
print (by_day_df)
输出:
day agent
2017-01-01 Jen 1.01 1
Joe 3.02 1
2017-02-01 Jen 2.02 1
Joe 0.02 1
2017-03-01 Jen 8.03 2
Joe 4.04 1
因此,在过滤后,我不希望显示2017-02-01的任何一行,因为当天的总数< 3。
可以使用.filter()吗?
答案 0 :(得分:4)
groupby
索引的第一级,并取day_tl_trx_amount
列的总和3
查找总和> = 3
的日期idx = by_day_df.groupby(level='day')[['day_tl_trx_amount']].sum() \
.query('day_tl_trx_amount >= 3').index.tolist()
过滤第一个数据框
by_day_df.loc[idx]
day_tl_trx_amount trx_count
day agent
2017-01-01 Jen 1.01 1
Joe 3.02 1
2017-03-01 Jen 8.03 2
Joe 4.04 1
unstack
和sum
更优雅一点
我的首选解决方案
s = by_day_df.unstack().day_tl_trx_amount.sum(1).ge(3)
by_day_df.loc[s.index[s].tolist()]
day_tl_trx_amount trx_count
day agent
2017-01-01 Jen 1.01 1
Joe 3.02 1
2017-03-01 Jen 8.03 2
Joe 4.04 1
答案 1 :(得分:2)
我认为您可以按groupby
按第一级索引删除行,并汇总sum
,最后drop
这些行:
df1 = by_day_df.groupby(level=0)['day_tl_trx_amount'].sum()
idx = df1[df1 < 3].index
print (idx)
Index([2017-02-01], dtype='object', name='day')
print (by_day_df.drop(idx, level=0))
day_tl_trx_amount trx_count
day agent
2017-01-01 Jen 1.01 1
Joe 3.02 1
2017-03-01 Jen 8.03 2
Joe 4.04 1
类似的解决方案,选择loc
所需的日期:
df1 = by_day_df.groupby(level=0)['day_tl_trx_amount'].sum()
print (df1)
day
2017-01-01 4.03
2017-02-01 2.04
2017-03-01 12.07
Name: day_tl_trx_amount, dtype: float64
idx = df1[df1 >= 3].index.tolist()
print (idx)
[datetime.date(2017, 1, 1), datetime.date(2017, 3, 1)]
print (by_day_df.loc[idx])
day_tl_trx_amount trx_count
day agent
2017-01-01 Jen 1.01 1
Joe 3.02 1
2017-03-01 Jen 8.03 2
Joe 4.04 1
您的代码也有一些改进,主要是Series.to_frame
,用于从DataFrame
创建Series
:
#vectorized to_datetime and then dt.date
df['day'] = pd.to_datetime(df['trx_time']).dt.date
grouped = df.groupby(['day', 'agent'])
by_day_df = grouped.trx_amount.sum().to_frame() \
.rename(columns = lambda x: 'day_tl_' + x)\
.join(grouped.size().to_frame('trx_count'))
print (by_day_df)
agent trx_amount trx_time
0 Jen 1.01 2017-01-01 00:00:01.012345
1 Joe 3.02 2017-01-01 01:00:00.012345
2 Jen 2.02 2017-02-01 00:00:00.012345
3 Joe 0.02 2017-02-01 02:00:00.012345
4 Jen 3.02 2017-03-01 03:00:00.012345
5 Joe 4.04 2017-03-01 04:00:00.012345
6 Jen 5.01 2017-03-01 05:00:01.012345
答案 2 :(得分:1)
我试图用掩码解决它:
by_day_df.reset_index(inplace=True)
mask=by_day_df.groupby('day')['day_tl_trx_amount'].sum()>3
by_day_df.set_index('day',inplace=True)
by_day_df=by_day_df[mask]
by_day_df.reset_index(inplace=True)
by_day_df.set_index(['day','agent'],inplace=True)