Python中的字符串匹配引用相同的实体

时间:2018-01-16 05:04:31

标签: python-3.x fuzzywuzzy

我正在处理一些实体匹配问题,我必须检查记录是否引用同一个商业实体,请看下面两个用管道分隔的记录,现在管道两边的文字参考同一实体,第一记录有Fairvill普通记录和第二记录有沃尔玛901常见。 是否有任何字符串匹配功能可以执行这种比较。

我在python中尝试过soundex和fuzzywuzzy,但结果不是那个提示,任何帮助都非常感激。

List<String> subjectArr = Arrays.asList("aa", "bb", "cc");
List<Long> numArr = Arrays.asList(2L, 6L, 4L);
List<Pair> pairs = Streams.zip(subjectArr.stream(), numArr.stream(), Pair::new)
        .collect(Collectors.toList());

1 个答案:

答案 0 :(得分:0)

  

参考

def fit(self, sentence_pairs):
    """ Estimate of missing probability for each symbol
    Parameters:
        sentence_pairs - list of (original phrase, abbreviation)
    In the abbreviation, all missed symbols are replaced with "-"
    """
    self.missed_counter_ = defaultdict(lambda: Counter())
    self.total_counter_ = defaultdict(lambda: Counter())
    for (original, observed) in sentence_pairs:
        for i, (original_letter, observed_letter) \
                in enumerate(zip(original[self.order:], observed[self.order:])):
            context = original[i:(i+self.order)]
            if observed_letter == '-':
                self.missed_counter_[context][original_letter] += 1
            self.total_counter_[context][original_letter] += 1 

def predict_proba(self, context, last_letter):
    """ Estimate of probability of last_letter being missed after context"""
    if self.order:
        local = context[-self.order:]
    else:
        local = ''
    missed_freq = self.missed_counter_[local][last_letter] + self.smoothing_missed
    total_freq = self.total_counter_[local][last_letter] + self.smoothing_total
    return missed_freq / total_freq