我有一个人创建的两个数据帧,我需要分别在datetime
和id
上merge_asof。左数据框是这样创建的:
import pandas as pd
import pytz
from datetime import datetime
from datetime import timezone
dates = [datetime(2020, 1, 2, 8, 0, 0, 824000),
datetime(2020, 1, 8, 6, 2, 52, 833000),
datetime(2020, 1, 9, 22, 41, 18, 858000),
datetime(2020, 1, 16, 8, 0, 1, 404000),
datetime(2020, 1, 22, 8, 0, 1, 560000),
datetime(2020, 1, 23, 8, 0, 1, 493000)
]
timezone = pytz.timezone('US/Eastern')
dates_localized = [timezone.localize(d) for d in dates ]
ids = [1,1,1,2,2,2]
headlines = ['abc','def','jkl', 'mno','pqr', 'stx']
left = pd.DataFrame({'date':dates_localized, 'id':ids, 'headlines':headlines})
print(left)
date id headlines
0 2020-01-02 08:00:00.824000-05:00 1 abc
1 2020-01-08 06:02:52.833000-05:00 1 def
2 2020-01-09 22:41:18.858000-05:00 1 jkl
3 2020-01-16 08:00:01.404000-05:00 2 mno
4 2020-01-22 08:00:01.560000-05:00 2 pqr
5 2020-01-23 08:00:01.493000-05:00 2 stx
右数据框的创建与此类似:
index = pd.DatetimeIndex(['2020-01-02 07:30:00.070041845',
'2020-01-08 05:30:00.167110660',
'2020-01-09 09:30:00.185073458',
'2020-01-16 09:30:00.190448059',
'2020-01-22 07:30:00.286648287',
'2020-01-22 06:30:00.376308078'])
right = pd.DataFrame({'id':[1,1,1,2,2,2], 'value':[1,0,0,1,1,0]})
right = right.set_index(index)
right.index.name = 'date'
print(right)
id value
date
2020-01-02 07:30:00.070041845 1 1
2020-01-08 05:30:00.167110660 1 0
2020-01-09 09:30:00.185073458 1 0
2020-01-16 09:30:00.190448059 2 1
2020-01-22 07:30:00.286648287 2 1
2020-01-22 06:30:00.376308078 2 0
合并:
df = pd.merge_asof(left, right, on='date', by='id')
结果错误:
MergeError: incompatible merge keys [1] datetime64[ns, US/Eastern] and dtype('<M8[ns]'), must be the same type
有什么想法可以将时间转换为可以merge_asof
的一种类型吗?
答案 0 :(得分:1)
一个想法是使用DataFrame.tz_localize
将时区设置为Datetimeindex
:
df = pd.merge_asof(left, right.tz_localize('US/Eastern').sort_index(), on='date', by='id')
print (df)
date id headlines value
0 2020-01-02 08:00:00.824000-05:00 1 abc 1.0
1 2020-01-08 06:02:52.833000-05:00 1 def 0.0
2 2020-01-09 22:41:18.858000-05:00 1 jkl 0.0
3 2020-01-16 08:00:01.404000-05:00 2 mno NaN
4 2020-01-22 08:00:01.560000-05:00 2 pqr 1.0
5 2020-01-23 08:00:01.493000-05:00 2 stx 1.0
编辑:如有必要,将时区设置为date
列:
left['date'] = left['date'].dt.tz_localize('US/Eastern')
df = pd.merge_asof(left, right.sort_index(), on='date', by='id')