如何根据最近(或最近)的时间戳

时间:2015-12-17 11:00:15

标签: python pandas

假设我有一个数据帧df1,其中包含列' A'和' B'。 A是一列时间戳(例如,unixtime)和' B'是一个有价值的专栏。

假设我还有一个带有列的数据帧df2' C'和' D'。 C也是一个unixtime列,D是包含其他值的列。

我想在merge上使用联接模糊timestamp数据框。但是,如果时间戳不匹配(他们很可能不会),我希望它能合并到A' A'中的时间戳之前的最近的条目上。它可以在C'。

中找到

pd.merge不支持这一点,我发现自己使用to_dict()转换远离数据帧,并使用一些迭代来解决这个问题。大熊猫有办法解决这个问题吗?

2 个答案:

答案 0 :(得分:3)

numpy.searchsorted()(see docs)上找到适当的index位置merge - 希望以下内容让您更接近您正在寻找的内容:

start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))

                    A  B                   C   D
0 2015-12-01 00:01:00  1                 NaT NaN
1 2015-12-01 00:02:00  1 2015-12-01 00:02:00   2
2 2015-12-01 00:02:00  1                 NaT NaN
3 2015-12-01 00:12:00  1 2015-12-01 00:05:00   2
4 2015-12-01 00:16:00  1 2015-12-01 00:14:00   2
4 2015-12-01 00:16:00  1 2015-12-01 00:14:00   2
5 2015-12-01 00:28:00  1 2015-12-01 00:22:00   2
6 2015-12-01 00:30:00  1                 NaT NaN
7 2015-12-01 00:39:00  1 2015-12-01 00:31:00   2
7 2015-12-01 00:39:00  1 2015-12-01 00:39:00   2
8 2015-12-01 00:55:00  1 2015-12-01 00:40:00   2
8 2015-12-01 00:55:00  1 2015-12-01 00:46:00   2
8 2015-12-01 00:55:00  1 2015-12-01 00:54:00   2
9 2015-12-01 00:57:00  1                 NaT NaN

答案 1 :(得分:0)

在@ Stephan的回答和@ JohnE的评论的基础上,pandas> = 0.19.0的pandas.merge_asof可以做类似的事情:

>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})

                    A  B
0 2015-12-01 00:00:00  0
1 2015-12-01 00:30:00  1
2 2015-12-01 01:00:00  2
3 2015-12-01 01:30:00  3
4 2015-12-01 02:00:00  4
5 2015-12-01 02:30:00  5
6 2015-12-01 03:00:00  6
7 2015-12-01 03:30:00  7
8 2015-12-01 04:00:00  8
9 2015-12-01 04:30:00  9

>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})

                   C   D
0 2015-12-01 00:00:00  10
1 2015-12-01 01:00:00  11
2 2015-12-01 02:00:00  12
3 2015-12-01 03:00:00  13
4 2015-12-01 04:00:00  14
5 2015-12-01 05:00:00  15
6 2015-12-01 06:00:00  16
7 2015-12-01 07:00:00  17
8 2015-12-01 08:00:00  18
9 2015-12-01 09:00:00  19

>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')

                    A  B                   C   D
0 2015-12-01 00:00:00  0 2015-12-01 00:00:00  10
1 2015-12-01 00:30:00  1 2015-12-01 00:00:00  10
2 2015-12-01 01:00:00  2 2015-12-01 01:00:00  11
3 2015-12-01 01:30:00  3 2015-12-01 01:00:00  11
4 2015-12-01 02:00:00  4 2015-12-01 02:00:00  12
5 2015-12-01 02:30:00  5 2015-12-01 02:00:00  12
6 2015-12-01 03:00:00  6 2015-12-01 03:00:00  13
7 2015-12-01 03:30:00  7 2015-12-01 03:00:00  13
8 2015-12-01 04:00:00  8 2015-12-01 04:00:00  14
9 2015-12-01 04:30:00  9 2015-12-01 04:00:00  14