我想根据唯一编号和 +/-7 天内的日期匹配来合并两个数据框
df1
Number Report DateDone
1 some words 13/1/2021
1 more stuff 21/8/2021
44 balbla 11/4/2020
2 gobbledy bla 01/03/2019
44 rara rasputin 13/10/2021
44 tree frogs 11/10/2010
df2
Number Report DateDone
1 hocum poklum 11/1/2021
1 mjimmeny cricket 21/8/2021
44 it wasnt me 11/2/2020
2 its not really 6/03/2019
44 im innocent 12/10/2021
44 bullfrogs 11/01/2010
Number.df1 Report.df1 DateDone.df1 Number.df2 Report.df2 DateDone.df2
1 some words 13/1/2021 1 hocum poklum 11/1/2021
1 more stuff 21/8/2021 1 jimmeny cricket 21/8/2021
2 gobbledy bla 01/03/2019 2 its not really 6/03/2019
44 rara rasputin 13/10/2021 44 im innocent 12/10/2021
我打算使用类似于我发现的 here 的 sql 合并,但我很难知道如何合并数字和日期范围。我是否需要计算 df1 中 DateDone 前后的 7 天?肯定有比必须先计算两个新列更有效的方法吗?
qry = '''
select
df1.DateDone_start TermStart,
df1.DateDone_end TermEnd,
df2.DateDone df2Start,
df1.Number,
df2.Number
from
df1 join df2 on
date between df1.DateDone_start and df1.DateDone_end join df1 on
df1.Number = df2.Number
'''
df = pd.read_sql_query(qry, conn)
答案 0 :(得分:1)
您可以在 Number
上使用 .merge()
,然后使用 .loc
过滤条件,其中 DateDone.df2
为 .between()
DateDone.df1
+/- 7 天通过使用 +/-pd.DateOffset(days=7)
,如下所示:
df1['DateDone'] = pd.to_datetime(df1['DateDone'], dayfirst=True)
df2['DateDone'] = pd.to_datetime(df2['DateDone'], dayfirst=True)
df_merge = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
result = df_merge.loc[
df_merge['DateDone.df2'].between(
df_merge['DateDone.df1'] - pd.DateOffset(days=7),
df_merge['DateDone.df1'] + pd.DateOffset(days=7))]
结果:
print(result)
Number Report.df1 DateDone.df1 Report.df2 DateDone.df2
0 1 some words 2021-01-13 hocum poklum 2021-01-11
3 1 more stuff 2021-08-21 mjimmeny cricket 2021-08-21
8 44 rara rasputin 2021-10-13 im innocent 2021-10-12
13 2 gobbledy bla 2019-03-01 its not really 2019-03-06
答案 1 :(得分:0)
尝试 merge
然后过滤掉 7 天内的行:
new_df = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
new_df = new_df[
abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
]
new_df
:
Number Report.df1 DateDone.df1 Report.df2 DateDone.df2
0 1 some words 2021-01-13 hocum poklum 2021-01-11
3 1 more stuff 2021-08-21 mjimmeny cricket 2021-08-21
8 44 rara rasputin 2021-10-13 im innocent 2021-10-12
13 2 gobbledy bla 2019-03-01 its not really 2019-03-06
如果尚未完成,则将两个帧的“DateDone”转换为 DateTime:
df1['DateDone'] = pd.to_datetime(df1['DateDone'], format='%d/%m/%Y')
df2['DateDone'] = pd.to_datetime(df2['DateDone'], format='%d/%m/%Y')
获取两个日期时间之间的持续时间
new_df['DateDone.df1'] - new_df['DateDone.df2']
0 2 days
1 -220 days
2 222 days
3 0 days
4 60 days
5 -549 days
6 3743 days
7 610 days
8 1 days
9 4293 days
10 -3410 days
11 -4019 days
12 273 days
13 -5 days
dtype: timedelta64[ns]
应用 abs
从持续时间中移除方向性并与所需的持续时间进行比较:
abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
使用此索引来确定要保留哪些行:
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
12 False
13 True
dtype: bool