我有两个dataframes
,每个datetime
列:
df_long=
mytime_long
0 00:00:01 1/10/2013
1 00:00:05 1/10/2013
2 00:00:55 1/10/2013
df_short=
mytime_short
0 00:00:02 1/10/2013
1 00:00:03 1/10/2013
2 00:00:06 1/10/2013
时间戳是唯一的,可以假设在两个数据帧的每一个中进行排序。
我想在mytime_long
中创建一个包含最近(索引,mytime_short
)之后或同时值的新数据框(因此使用非{负时间)
ex.
0 (0, 00:00:02 1/10/2013)
1 (2, 00:00:06 1/10/2013)
2 (np.nan,np.nat)
答案 0 :(得分:3)
编写一个函数来获得最接近的索引&给定时间戳的df_short中的时间戳
def get_closest(n):
mask = df_short.mytime_short >= n
ids = np.where(mask)[0]
if ids.size > 0:
return ids[0], df_short.mytime_short[ids[0]]
else:
return np.nan, np.nan
在df_long.mytime_long
上应用此功能,以获得索引&的新数据框。元组中的时间戳值
df = df_long.mytime_long.apply(get_closest)
df
# output:
0 (0, 2013-01-10 00:00:02)
1 (2, 2013-01-10 00:00:06)
2 (nan, nan)
ilia timofeev的回答让我想起了pandas.merge_asof
function,这对于这种类型的加入非常适合
df = pd.merge_asof(df_long,
df_short.reset_index(),
left_on='mytime_long',
right_on='mytime_short',
direction='forward')[['index', 'mytime_short']]
df
# output:
index mytime_short
0 0.0 2013-01-10 00:00:02
1 2.0 2013-01-10 00:00:06
2 NaN NaT
答案 1 :(得分:1)
#recreate data
df_long = pd.DataFrame(
pd.to_datetime( ['00:00:01 1/10/2013','00:00:05 1/10/2013','00:00:55 1/10/2013']),
index = [0,1,2],columns = ['mytime_long'])
df_short = pd.DataFrame(
pd.to_datetime( ['00:00:02 1/10/2013','00:00:03 1/10/2013','00:00:06 1/10/2013']),
index = [0,1,2],columns = ['mytime_short'])
#join by time, preserving ids
df_all = df_short.assign(inx_s=df_short.index).set_index('mytime_short').join(
df_long.assign(inx_l=df_long.index).set_index('mytime_long'),how='outer')
#mark all "short" rows with nearest "long" id
df_all['inx_l'] = df_all.inx_l.ffill().fillna(-1)
#select "short" rows
df_short_candidate = df_all[~df_all.inx_s.isnull()].astype(int)
df_short_candidate['mytime_short'] = df_short_candidate.index
#select get minimal "short" time in "long" group,
#join back with long to recover empty intersection
df_res = df_long.join(df_short_candidate.groupby('inx_l').first())
print (df_res)
输出:
mytime_long inx_s mytime_short
0 2013-01-10 00:00:01 0.0 2013-01-10 00:00:02
1 2013-01-10 00:00:05 2.0 2013-01-10 00:00:06
2 2013-01-10 00:00:55 NaN NaT
100000个元素样本的性能比较:
df_long.mytime_long.apply(get_closest)
UPD:但获胜者是@Haleemur Ali的pd.merge_asof
10ms