我一直在尝试将自定义的日期比较与Python的记录链接一起使用。我发现这个site对开发比较算法很有帮助。不幸的是,我认为不再支持此处概述的方法。记录链接website给出了覆盖compute_vectorize的指令:
from recordlinkage.base import BaseCompareFeature
class CustomFeature(BaseCompareFeature):
def _compute_vectorized(s1, s2):
# algorithm that compares s1 and s2
# return a pandas.Series
return ...
feat = CustomFeature()
feat.compute(pairs, dfA, dfB)
在此之后,我尝试编写一个比较(可作为独立功能使用),但我对类不熟悉,并收到错误(如下)。
class DateAppr(BaseCompareFeature):
def _compute_vectorized(d1, d2, day_margin = 7):
# Absolute time difference in days
tdelta = d1 - d2
tdays = abs(tdelta.days)
score = 0
if tdays <= day_margin:
score += 1
else:
days_out = min(tdays - day_margin, 100)
penalty = (100-days_out)**2 / 100**2
score += penalty
return pd.Series(score, dtype='float64')
indexer = recordlinkage.Index()
indexer.block(left_on=('district'), right_on=('District'))
candidate_links = indexer.index(df1, df2)
feature = DateAppr('dob', 'min_dob')
date_vectors = feature.compute(candidate_links, x = df1, x_link = df2)
错误
File "/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/datetimelike.py", line 1325, in __rsub__
return -(self - other)
TypeError: unsupported operand type(s) for -: 'DatetimeArray' and 'DateAppr'