假设我有一个数据帧df1,其中包含列' A'和' B'。 A是一列时间戳(例如,unixtime)和' B'是一个有价值的专栏。
假设我还有一个带有列的数据帧df2' C'和' D'。 C也是一个unixtime列,D是包含其他值的列。
我想在merge
上使用联接模糊timestamp
数据框。但是,如果时间戳不匹配(他们很可能不会),我希望它能合并到A' A'中的时间戳之前的最近的条目上。它可以在C'。
pd.merge不支持这一点,我发现自己使用to_dict()转换远离数据帧,并使用一些迭代来解决这个问题。大熊猫有办法解决这个问题吗?
答案 0 :(得分:3)
numpy.searchsorted()
在(see docs)上找到适当的index
位置merge
- 希望以下内容让您更接近您正在寻找的内容:
start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))
A B C D
0 2015-12-01 00:01:00 1 NaT NaN
1 2015-12-01 00:02:00 1 2015-12-01 00:02:00 2
2 2015-12-01 00:02:00 1 NaT NaN
3 2015-12-01 00:12:00 1 2015-12-01 00:05:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
4 2015-12-01 00:16:00 1 2015-12-01 00:14:00 2
5 2015-12-01 00:28:00 1 2015-12-01 00:22:00 2
6 2015-12-01 00:30:00 1 NaT NaN
7 2015-12-01 00:39:00 1 2015-12-01 00:31:00 2
7 2015-12-01 00:39:00 1 2015-12-01 00:39:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:40:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:46:00 2
8 2015-12-01 00:55:00 1 2015-12-01 00:54:00 2
9 2015-12-01 00:57:00 1 NaT NaN
答案 1 :(得分:0)
在@ Stephan的回答和@ JohnE的评论的基础上,pandas> = 0.19.0的pandas.merge_asof可以做类似的事情:
>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})
A B
0 2015-12-01 00:00:00 0
1 2015-12-01 00:30:00 1
2 2015-12-01 01:00:00 2
3 2015-12-01 01:30:00 3
4 2015-12-01 02:00:00 4
5 2015-12-01 02:30:00 5
6 2015-12-01 03:00:00 6
7 2015-12-01 03:30:00 7
8 2015-12-01 04:00:00 8
9 2015-12-01 04:30:00 9
>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})
C D
0 2015-12-01 00:00:00 10
1 2015-12-01 01:00:00 11
2 2015-12-01 02:00:00 12
3 2015-12-01 03:00:00 13
4 2015-12-01 04:00:00 14
5 2015-12-01 05:00:00 15
6 2015-12-01 06:00:00 16
7 2015-12-01 07:00:00 17
8 2015-12-01 08:00:00 18
9 2015-12-01 09:00:00 19
>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')
A B C D
0 2015-12-01 00:00:00 0 2015-12-01 00:00:00 10
1 2015-12-01 00:30:00 1 2015-12-01 00:00:00 10
2 2015-12-01 01:00:00 2 2015-12-01 01:00:00 11
3 2015-12-01 01:30:00 3 2015-12-01 01:00:00 11
4 2015-12-01 02:00:00 4 2015-12-01 02:00:00 12
5 2015-12-01 02:30:00 5 2015-12-01 02:00:00 12
6 2015-12-01 03:00:00 6 2015-12-01 03:00:00 13
7 2015-12-01 03:30:00 7 2015-12-01 03:00:00 13
8 2015-12-01 04:00:00 8 2015-12-01 04:00:00 14
9 2015-12-01 04:30:00 9 2015-12-01 04:00:00 14