我正在处理一些实体匹配问题,我必须检查记录是否引用同一个商业实体,请看下面两个用管道分隔的记录,现在管道两边的文字参考同一实体,第一记录有Fairvill普通记录和第二记录有沃尔玛901常见。 是否有任何字符串匹配功能可以执行这种比较。
我在python中尝试过soundex和fuzzywuzzy,但结果不是那个提示,任何帮助都非常感激。
List<String> subjectArr = Arrays.asList("aa", "bb", "cc");
List<Long> numArr = Arrays.asList(2L, 6L, 4L);
List<Pair> pairs = Streams.zip(subjectArr.stream(), numArr.stream(), Pair::new)
.collect(Collectors.toList());
答案 0 :(得分:0)
参考
def fit(self, sentence_pairs):
""" Estimate of missing probability for each symbol
Parameters:
sentence_pairs - list of (original phrase, abbreviation)
In the abbreviation, all missed symbols are replaced with "-"
"""
self.missed_counter_ = defaultdict(lambda: Counter())
self.total_counter_ = defaultdict(lambda: Counter())
for (original, observed) in sentence_pairs:
for i, (original_letter, observed_letter) \
in enumerate(zip(original[self.order:], observed[self.order:])):
context = original[i:(i+self.order)]
if observed_letter == '-':
self.missed_counter_[context][original_letter] += 1
self.total_counter_[context][original_letter] += 1
def predict_proba(self, context, last_letter):
""" Estimate of probability of last_letter being missed after context"""
if self.order:
local = context[-self.order:]
else:
local = ''
missed_freq = self.missed_counter_[local][last_letter] + self.smoothing_missed
total_freq = self.total_counter_[local][last_letter] + self.smoothing_total
return missed_freq / total_freq