我正在尝试加入两个数据帧,这些数据帧的日期并不完美匹配。对于左侧数据帧中的给定组/日期,我想要从右侧数据框加入相应的记录,并使用左侧数据帧之前的日期。可能最容易用一个例子展示。
DF1:
group date teacher
a 1/10/00 1
a 2/27/00 1
b 1/7/00 1
b 4/5/00 1
c 2/9/00 2
c 9/12/00 2
DF2:
teacher date hair length
1 1/1/00 4
1 1/5/00 8
1 1/30/00 20
1 3/20/00 100
2 1/1/00 0
2 8/10/00 50
给我们:
group date teacher hair length
a 1/10/00 1 8
a 2/27/00 1 20
b 1/7/00 1 8
b 4/5/00 1 100
c 2/9/00 2 0
c 9/12/00 2 50
编辑1: 黑客攻击一个方法来做到这一点。基本上我遍历df1中的每一行,并在df2中选出最近的相应条目。这是非常缓慢的,肯定有一个更好的方法。
答案 0 :(得分:1)
执行此操作的一种方法是在左侧数据框中创建一个新列,该列将(对于给定的行的日期)确定最接近和更早的值:
df1['join_date'] = df1.date.map(lambda x: df2.date[df2.date <= x].max())
然后左侧'join_date'
和右侧'date'
之间的常规联接或合并将起作用。您可能需要调整函数来处理Null值或其他极端情况。
这不是很有效(你一遍又一遍地搜索右手日期)。一种更有效的方法是按日期对数据帧进行排序,遍历左侧数据帧,并使用右侧数据帧中的条目直到日期更大:
# Assuming df1 and df2 are sorted by the dates
df1['hair length'] = 0 # initialize
r_generator = df2.iterrows()
_, cur_r_row = next(r_generator)
for i, l_row in df1.iterrows():
cur_hair_length = 0 # Assume 0 works when df1 has a date earlier than df2
while cur_r_row['date'] <= l_row['date']:
cur_hair_length = cur_r_row['hair length']
try:
_, cur_r_row = next(r_generator)
except StopIteration:
break
df1.loc[i, 'hair length'] = cur_hair_length
答案 1 :(得分:0)
似乎最快的方法是通过pysqldf使用sqlite:
def partial_versioned_join(tablea, tableb, tablea_keys, tableb_keys):
try:
tablea_group, tablea_date = tablea_keys
tableb_group, tableb_date = tableb_keys
except ValueError, e:
raise(e, 'Need to pass in both a group and date key for both tables')
# Note: can't actually use group here as a field name due to sqlite
statement = """SELECT a.group, a.{date_a} AS {temp_date}, b.*
FROM (SELECT tablea.group, tablea.{date_a}, tablea.{group_a},
MAX(tableb.{date_b}) AS tdate
FROM tablea
JOIN tableb
ON tablea.{group_a}=tableb.{group_b}
AND tablea.{date_a}>=tableb.{date_b}
GROUP BY tablea.{base_id}, tablea.{date_a}, tablea.{group_a}
) AS a
JOIN tableb b
ON a.{group_a}=b.{group_b}
AND a.tdate=b.{date_b};
""".format(group_a=tablea_group, date_a=tablea_date,
group_b=tableb_group, date_b=tableb_date,
temp_date='join_date', base_id=base_id)
# Note: you lose types here for tableb so you may want to save them
pre_join_tableb = sqldf(statement, locals())
return pd.merge(tablea, pre_join_tableb, how='inner',
left_on=['group'] + tablea_keys,
right_on=['group', tableb_group, 'join_date'])