我想将一个数据框合并到另一个数据框,其中合并是以特定范围内的日期/时间为条件的。
例如,假设我有以下两个数据框。
import pandas as pd
import datetime
# Create main data frame.
data = pd.DataFrame()
time_seq1 = pd.DataFrame(pd.date_range('1/1/2016', periods=3, freq='H'))
time_seq2 = pd.DataFrame(pd.date_range('1/2/2016', periods=3, freq='H'))
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq1, ignore_index=True)
data = data.append(time_seq2, ignore_index=True)
data['myID'] = ['001','001','001','002','002','002','003','003','003','004','004','004']
data.columns = ['Timestamp', 'myID']
# Create second data frame.
data2 = pd.DataFrame()
data2['time'] = [pd.to_datetime('1/1/2016 12:06 AM'), pd.to_datetime('1/1/2016 1:34 AM'), pd.to_datetime('1/2/2016 12:25 AM')]
data2['myID'] = ['002', '003', '004']
data2['specialID'] = ['foo_0', 'foo_1', 'foo_2']
# Show data frames.
data
Timestamp myID
0 2016-01-01 00:00:00 001
1 2016-01-01 01:00:00 001
2 2016-01-01 02:00:00 001
3 2016-01-01 00:00:00 002
4 2016-01-01 01:00:00 002
5 2016-01-01 02:00:00 002
6 2016-01-01 00:00:00 003
7 2016-01-01 01:00:00 003
8 2016-01-01 02:00:00 003
9 2016-01-02 00:00:00 004
10 2016-01-02 01:00:00 004
11 2016-01-02 02:00:00 004
data2
time myID specialID
0 2016-01-01 00:06:00 002 foo_0
1 2016-01-01 01:34:00 003 foo_1
2 2016-01-02 00:25:00 004 foo_2
我想构造以下输出。
# Desired output.
Timestamp myID special_ID
0 2016-01-01 00:00:00 001 NaN
1 2016-01-01 01:00:00 001 NaN
2 2016-01-01 02:00:00 001 NaN
3 2016-01-01 00:00:00 002 NaN
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 NaN
6 2016-01-01 00:00:00 003 NaN
7 2016-01-01 01:00:00 003 NaN
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 NaN
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 NaN
特别是,我想将special_ID
合并到data
,以便Timestamp
首次出现在time
的值之后。例如,foo_0
将与2016-01-01 01:00:00
对应myID = 002
对应的行,因为这是data
紧跟2016-01-01 00:06:00
后的下一次{{1}在time
的行中包含special_ID = foo_0
}。
注意,myID = 002
不是Timestamp
的索引,而data
不是time
的索引。大多数其他相关帖子似乎依赖于使用datetime对象作为数据框的索引。
答案 0 :(得分:8)
你可以使用Pandas 0.19中新增的merge_asof
来完成大部分工作。然后,合并let expValue = e2.expressionValue(with: nil, context: nil)
// Error
和loc
以删除辅助匹配:
duplicated
结果输出:
# Data needs to be sorted for merge_asof.
data = data.sort_values(by='Timestamp')
# Perform the merge_asof.
df = pd.merge_asof(data, data2, left_on='Timestamp', right_on='time', by='myID').drop('time', axis=1)
# Make the additional matches null.
df.loc[df['specialID'].duplicated(), 'specialID'] = np.nan
# Get the original ordering.
df = df.set_index(data.index).sort_index()
答案 1 :(得分:0)
不是很漂亮,但我觉得它很有效。
data['specialID'] = None
foolist = list(data2['myID'])
for i in data.index:
if data.myID[i] in foolist:
if data.Timestamp[i]> list(data2[data2['myID'] == data.myID[i]].time)[0]:
data['specialID'][i] = list(data2[data2['myID'] == data.myID[i]].specialID)[0]
foolist.remove(list(data2[data2['myID'] == data.myID[i]].myID)[0])
In [95]: data
Out[95]:
Timestamp myID specialID
0 2016-01-01 00:00:00 001 None
1 2016-01-01 01:00:00 001 None
2 2016-01-01 02:00:00 001 None
3 2016-01-01 00:00:00 002 None
4 2016-01-01 01:00:00 002 foo_0
5 2016-01-01 02:00:00 002 None
6 2016-01-01 00:00:00 003 None
7 2016-01-01 01:00:00 003 None
8 2016-01-01 02:00:00 003 foo_1
9 2016-01-02 00:00:00 004 None
10 2016-01-02 01:00:00 004 foo_2
11 2016-01-02 02:00:00 004 None