定制比较Python RecordLinkage

时间:2019-03-14 15:26:55

标签: python record-linkage

我一直在尝试将自定义的日期比较与Python的记录链接一起使用。我发现这个site对开发比较算法很有帮助。不幸的是,我认为不再支持此处概述的方法。记录链接website给出了覆盖compute_vectorize的指令:

from recordlinkage.base import BaseCompareFeature

class CustomFeature(BaseCompareFeature):

    def _compute_vectorized(s1, s2):
        # algorithm that compares s1 and s2

        # return a pandas.Series
        return ...

feat = CustomFeature()
feat.compute(pairs, dfA, dfB)

在此之后,我尝试编写一个比较(可作为独立功能使用),但我对类不熟悉,并收到错误(如下)。

class DateAppr(BaseCompareFeature):

    def _compute_vectorized(d1, d2, day_margin = 7):
        # Absolute time difference in days
        tdelta = d1 - d2
        tdays = abs(tdelta.days)
        score = 0
        if tdays <= day_margin:
            score += 1
        else:
            days_out = min(tdays - day_margin, 100)
            penalty = (100-days_out)**2 / 100**2
            score += penalty
        return pd.Series(score, dtype='float64')

indexer = recordlinkage.Index()
indexer.block(left_on=('district'), right_on=('District'))
candidate_links = indexer.index(df1, df2)

feature = DateAppr('dob', 'min_dob')
date_vectors = feature.compute(candidate_links, x = df1, x_link = df2)

错误

  File "/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/datetimelike.py", line 1325, in __rsub__
    return -(self - other)

TypeError: unsupported operand type(s) for -: 'DatetimeArray' and 'DateAppr'

0 个答案:

没有答案