合并 id 和日期范围列上的两个数据框

时间:2021-05-28 10:57:40

标签: python sql pandas

目标:

我想根据唯一编号和 +/-7 天内的日期匹配来合并两个数据框

数据:

df1

Number         Report         DateDone
1       some words      13/1/2021
1               more stuff      21/8/2021
44      balbla          11/4/2020
2       gobbledy bla    01/03/2019
44      rara rasputin   13/10/2021
44      tree frogs      11/10/2010

df2

Number         Report             DateDone
1       hocum poklum       11/1/2021
1       mjimmeny cricket   21/8/2021
44      it wasnt me        11/2/2020
2       its not really     6/03/2019
44      im innocent        12/10/2021
44      bullfrogs          11/01/2010

预期的结果

Number.df1     Report.df1   DateDone.df1     Number.df2    Report.df2     DateDone.df2
1              some words    13/1/2021              1          hocum poklum      11/1/2021
1              more stuff    21/8/2021              1          jimmeny cricket   21/8/2021
2              gobbledy bla  01/03/2019             2          its not really    6/03/2019
44             rara rasputin 13/10/2021             44         im innocent       12/10/2021

我打算使用类似于我发现的 here 的 sql 合并,但我很难知道如何合并数字和日期范围。我是否需要计算 df1 中 DateDone 前后的 7 天?肯定有比必须先计算两个新列更有效的方法吗?

qry = '''
    select  
        df1.DateDone_start TermStart,
        df1.DateDone_end TermEnd,
        df2.DateDone df2Start,
        df1.Number,
        df2.Number
    from
        df1 join df2 on
        date between df1.DateDone_start and df1.DateDone_end join df1 on
        df1.Number = df2.Number
    '''
df = pd.read_sql_query(qry, conn)

2 个答案:

答案 0 :(得分:1)

您可以在 Number 上使用 .merge(),然后使用 .loc 过滤条件,其中 DateDone.df2.between() DateDone.df1 +/- 7 天通过使用 +/-pd.DateOffset(days=7),如下所示:

df1['DateDone'] = pd.to_datetime(df1['DateDone'], dayfirst=True)
df2['DateDone'] = pd.to_datetime(df2['DateDone'], dayfirst=True)

df_merge = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))

result = df_merge.loc[
             df_merge['DateDone.df2'].between(
                 df_merge['DateDone.df1'] - pd.DateOffset(days=7), 
                 df_merge['DateDone.df1'] + pd.DateOffset(days=7))]

结果:

print(result)



    Number     Report.df1 DateDone.df1        Report.df2 DateDone.df2
0        1     some words   2021-01-13      hocum poklum   2021-01-11
3        1     more stuff   2021-08-21  mjimmeny cricket   2021-08-21
8       44  rara rasputin   2021-10-13       im innocent   2021-10-12
13       2   gobbledy bla   2019-03-01    its not really   2019-03-06

答案 1 :(得分:0)

尝试 merge 然后过滤掉 7 天内的行:

new_df = df1.merge(df2, on='Number', suffixes=('.df1', '.df2'))
new_df = new_df[
    abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)
    ]

new_df

    Number     Report.df1 DateDone.df1        Report.df2 DateDone.df2
0        1     some words   2021-01-13      hocum poklum   2021-01-11
3        1     more stuff   2021-08-21  mjimmeny cricket   2021-08-21
8       44  rara rasputin   2021-10-13       im innocent   2021-10-12
13       2   gobbledy bla   2019-03-01   its not really    2019-03-06

如果尚未完成,则将两个帧的“DateDone”转换为 DateTime:

df1['DateDone'] = pd.to_datetime(df1['DateDone'], format='%d/%m/%Y')
df2['DateDone'] = pd.to_datetime(df2['DateDone'], format='%d/%m/%Y')

获取两个日期时间之间的持续时间

new_df['DateDone.df1'] - new_df['DateDone.df2']
0        2 days
1     -220 days
2      222 days
3        0 days
4       60 days
5     -549 days
6     3743 days
7      610 days
8        1 days
9     4293 days
10   -3410 days
11   -4019 days
12     273 days
13      -5 days
dtype: timedelta64[ns]

应用 abs 从持续时间中移除方向性并与所需的持续时间进行比较:

abs(new_df['DateDone.df1'] - new_df['DateDone.df2']) <= pd.Timedelta(days=7)

使用此索引来确定要保留哪些行:

0      True
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13     True
dtype: bool
相关问题