我想选择或保留使用Pandas数据框仅在至少3个月内发生的相同交易说明(TRNDESCR)。我尝试了一些代码,但无法正常工作。
下面是示例数据集:
ACNO TIME TRNCD TRNDESCR TRNAMT
0 85 2018-12-19 20:40:00 109 Ib Transfer To Phoutthalom Syh Account No:123 -20000
1 85 2018-12-19 21:15:00 109 Ib Transfer To Phoutthalom Syh Account No:123 -25000
2 85 2018-12-20 15:30:00 109 Ib Transfer To Thongsavath Pra Account No:124 -10000
3 85 2018-12-22 12:30:00 209 Bil Payment -500
4 85 2018-12-25 15:34:00 109 Ib Transfer To Phoutthalom Syh Account No:123 -60000
5 85 2019-01-22 12:30:00 209 Bil Payment -501
6 85 2019-01-23 12:50:00 109 Ib Transfer To Sarah Account No:199 -3000
7 85 2019-01-31 08:59:00 109 Ib Transfer To Thongsavath Pra Account No:124 -650000
8 85 2019-02-02 12:30:00 109 Ib Transfer To Sarah Account No:199 -600
9 85 2019-02-03 15:02:00 109 Ib Transfer To Phoutthalom Syh Account No:123 -60000
10 85 2019-02-04 15:21:00 109 Ib Transfer To Thongsavath Pra Account No:124 -863000
11 85 2019-02-05 15:30:00 209 Bil Payment -600
以下是预期结果:
ACNO TIME TRNCD TRNDESCR TRNAMT
0 85 2018-12-20 15:30:00 109 Ib Transfer To Thongsavath Pra Account No:124 -10000
1 85 2018-12-22 12:30:00 209 Bil Payment -500
2 85 2019-01-22 12:30:00 209 Bil Payment -501
3 85 2019-01-31 08:59:00 109 Ib Transfer To Thongsavath Pra Account No:124 -650000
4 85 2019-02-04 15:21:00 109 Ib Transfer To Thongsavath Pra Account No:124 -863000
5 85 2019-02-05 15:30:00 209 Bil Payment -600
答案 0 :(得分:0)
选择要作为指标的列,就像您给出的示例一样,它是TRNDESCR,并且还希望将TIME放入TIME作为过滤器。然后,您可以根据TRNDESCR删除重复项并进行分组,然后根据月计数交易发生的时间。
示例:
import pandas as pd
df = pd.DataFrame()
df['TIME'] = ["2018-12-19", "2018-12-20", "2019-01-20", "2019-02-06",
"2018-12-18", "2018-12-02", "2019-01-03", "2019-02-06"]
df['TRNDESCR'] = ["ib1", "ib2", "ib2", "ib2",
"ib2", "ib3", "ib3", "ib3"]
df['ACNO'] = 85
df['TIME'] = pd.to_datetime(df['TIME'])
df['MONTH'] = df['TIME'].dt.month
count_month = df[['MONTH', 'TRNDESCR']].drop_duplicates(['MONTH', 'TRNDESCR'], keep="last").groupby('TRNDESCR')['MONTH'].count()
df[df['TRNDESCR'].isin(count_month[count_month >= 3].index)]
TIME TRNDESCR ACNO MONTH
1 2018-12-20 ib2 85 12
2 2019-01-20 ib2 85 1
3 2019-02-06 ib2 85 2
4 2018-12-18 ib2 85 12
5 2018-12-02 ib3 85 12
6 2019-01-03 ib3 85 1
7 2019-02-06 ib3 85 2
答案 1 :(得分:0)
这是我的解决方法
import pandas as pd
df = pd.read_excel("df_85.xlsx")
df_copy = df.copy()
# introduce new column
time = pd.DatetimeIndex(df_copy.TIME)
df_copy['yearmonth'] = time.year.astype(str) + time.month.astype(str)
# find month occurences within each TRNDESCR group
new_df = df_copy.groupby(['TRNDESCR']).yearmonth.nunique().to_frame().reset_index()
new_df = new_df[new_df.yearmonth >= 3]
# get row with TRNDESCR matches those in new_df
output_df = df[df.TRNDESCR.isin(new_df.TRNDESCR.values)]
print(output_df)
输出
ACNO YEAR MONTH TIME TRNCD TRNDESCR TRNAMT
2 85 2018 12 2018-12-20 15:30:00 109 Ib Transfer To Thongsavath Pra Account No:124 -10000
3 85 2018 12 2018-12-22 12:30:00 209 Bil Payment -500
5 85 2018 1 2019-01-22 12:30:00 209 Bil Payment -501
7 85 2019 1 2019-01-31 08:59:00 109 Ib Transfer To Thongsavath Pra Account No:124 -650000
10 85 2019 2 2019-02-04 15:21:00 109 Ib Transfer To Thongsavath Pra Account No:124 -863000
11 85 2019 2 2019-02-05 15:30:00 209 Bil Payment -600
通过创建新列“ yearmonth”(年份和月份的串联)来工作。然后对TRNDESCR进行分组,并计算每个组的唯一年份的月数。