我有两个pandas
DataFrames
,我正在尝试预先形成高级联接。在以下示例中,我想基于df
加入df2
和my_key
,其中日期范围from_dt
和to_dt
有重叠。我怎么能用熊猫做到这一点?
Ex df:
value, my_key, from_dt, to_dt
1, a, 2007-01-01, 2009-02-01
2, b, 2001-01-01, 2011-01-01
3, c, 2015-01-01, 2020-01-01
DF2:
my_key, value2, from_dt, to_dt
a, a1, 2007-01-01, 2008-01-01
a, a2, 2008-01-01, 2010-01-01
b, b1, 2009-01-01, 2015-01-01
c, c1, 2011-01-01, 2011-12-30
期望的结果:
value, value2, from_dt, to_dt
1, a1, 2007-01-01, 2008-01-01
1, a2, 2008-01-01, 2009-02-01
2, b1, 2009-01-01, 2011-01-01
答案 0 :(得分:2)
@ Jianxun的答案很棒 - 请注意,如果您的数据是CSV格式似乎建议的话,您可以自动获取pd.datetime
df = pd.read_csv("df.csv", parse_dates=True)
答案 1 :(得分:1)
这可以分两步完成。首先进行外部合并,然后保留重叠的行。
import pandas as pd
# your data
# ===================================
df
value my_key from_dt to_dt
0 1 a 2007-01-01 2009-02-01
1 2 b 2001-01-01 2011-01-01
2 3 c 2015-01-01 2020-01-01
df2
my_key value2 from_dt to_dt
0 a a1 2007-01-01 2008-01-01
1 a a2 2008-01-01 2010-01-01
2 b b1 2009-01-01 2015-01-01
3 c c1 2011-01-01 2011-12-30
# processing
# ======================================
# outer merge
df_temp = pd.merge(df, df2, on=['my_key'], how='outer')
# just make sure that the columns are in proper datetime type
# you don't have to do this if your data is already in datetime
df_temp.from_dt_x = pd.to_datetime(df_temp.from_dt_x)
df_temp.to_dt_x = pd.to_datetime(df_temp.to_dt_x)
df_temp.from_dt_y = pd.to_datetime(df_temp.from_dt_y)
df_temp.to_dt_y = pd.to_datetime(df_temp.to_dt_y)
value my_key from_dt_x to_dt_x value2 from_dt_y to_dt_y
0 1 a 2007-01-01 2009-02-01 a1 2007-01-01 2008-01-01
1 1 a 2007-01-01 2009-02-01 a2 2008-01-01 2010-01-01
2 2 b 2001-01-01 2011-01-01 b1 2009-01-01 2015-01-01
3 3 c 2015-01-01 2020-01-01 c1 2011-01-01 2011-12-30
# get rows that do overlap
result = df_temp[(df_temp.to_dt_x >= df_temp.from_dt_y) & (df_temp.from_dt_x <= df_temp.to_dt_y)]
value my_key from_dt_x to_dt_x value2 from_dt_y to_dt_y
0 1 a 2007-01-01 2009-02-01 a1 2007-01-01 2008-01-01
1 1 a 2007-01-01 2009-02-01 a2 2008-01-01 2010-01-01
2 2 b 2001-01-01 2011-01-01 b1 2009-01-01 2015-01-01
result['from_dt'] = result[['from_dt_x', 'from_dt_y']].max(axis=1)
result['to_dt'] = result[['to_dt_x', 'to_dt_y']].min(axis=1)
result.drop(['from_dt_x', 'to_dt_x', 'from_dt_y', 'to_dt_y'], axis=1)
value my_key value2 from_dt to_dt
0 1 a a1 2007-01-01 2008-01-01
1 1 a a2 2008-01-01 2009-02-01
2 2 b b1 2009-01-01 2011-01-01