没有重复约束的Argmax numpy - 模糊字符串匹配

时间:2014-08-03 01:31:46

标签: string python-2.7 numpy matrix fuzzy-comparison

我有两个字符串列表,一个名为l1,另一个名为l2。我有兴趣找到l1中每个字符串,l2中最匹配的字符串(但不是相反,即我只关心l1中的字符串)。我知道没有完美的比赛。我使用jaro-winkler评分来计算每个字符串的相似性,使用水母模块。

为了做到这一点,我创建了一个包含所有jaro-winkler分数的矩阵,然后找到maxtrix中每个点的最大值。但问题是,有时来自l2的字符串可能是来自l1的多个字符串的最佳匹配,我想阻止这一点。

是否有某种方法可以优化argmax方法,使得最大索引位置只能出现在结果矩阵中一次?

为了举例,两个列表和后续代码如下:

l1 = ['skinnycorebrokenblack184567', 'promtex2365h6', 'lovelinen940770', 'promtex2365h1', 'lovetrs844705', 
      'lovetrs844704', 'bennttrs49655', 'stella55900', 'kaxsprassel55250', 'smurfbs185573', 'kaxsprassel55880', 
      'victoriacort182062', 'juliatreggings916531', 'juliatreggings916530', 'milo63624505', 'promtex2365s2', 
      'promtex2365s1', 'promtex2365s6', 'promtex2365s4', 'stantwill160810', 'topazchini51081', 'topazchini51087',
      'juliatreggings187109', 'hansentrs50924', '2454s1ladiesjeanscolure', 'promtex2365h2']
l2 = ['stannewtwill160810', 'stellatrs55900', 'jennyhigh352300', 'victoriacort180565', 
      'mistylowribsatins818820202031', 'lovelinen940771', 'kaxsprasseltrs55250', 'milo63626624', 'lovetrs844702',
      'sarabootcuts842887019398270', 'sarabootcuts84288701939', 'victoriacords81805848817', 
      'ladiesjeanscolouredxxl2454s340999', 'julliatregging1871168817', 'logandrawstringpants92686705656', 
      '72480', 'victoriacords85203408817', 'julliatregging9673907817', 'lilypoplin9418412031', 'stellatrs56023',
      'tysontrs50626', 'bolttrousers51370', 'bellamystripe184539', 'tenrhino63602214', 'kidsthermotrousers2365h1',
      'bennytrouser53648', 'bluerinse070201072', 'topazchino51077', 'slimclassicblack674220203128999', 
      'milo63603812', 'milo63603813', 'milo63603814', 'slimclassicblack6742202031', 'lilypoplin9418402031', 
      'julliatregging9673917817', 'smurfjr185606', 'sarabootcuts81884571939', 'julliatregging9165318817']

#create the matrix
mat = np.matrix([[jf.jaro_distance(str(st1), str(st2)) 
              if jf.jaro_distance(str(st1), str(st2)) > 0.85 else 0 
              for st2 in l2] for st1 in l1])

#get max values
mat_max = (mat.argmax(1))

#create match dictionary
match_dict = {}
for x in xrange(len(mat_max)):
    if int(mat_max[x]):
        match_dict[styles[x]] = s2[int(mat_max[x])]

在上面的示例中,来自'topazchino51077'的{​​{1}}与来自l2的字符串匹配两次。这正是我希望防止的。来自l2的字符串应与最佳匹配匹配。

1 个答案:

答案 0 :(得分:1)

您可以根据经典stable marriage problem对问题进行建模。在您的问题中,成对匹配首选项由两个字符串的jaro_distance给出;我们希望将l1中的每个字符串与l2 中最接近的字符串相匹配,除非该字符串已与来自l1的另一个字符串配对,且相似度更高。

算法的核心是here。这是一种可能的实施方式:

xs = np.array([[jf.jaro_distance(x, y) for y in l2] for x in l1])
order = np.argsort(xs, axis=1)

FREE = -1 # special value to indicate no match yet
match = FREE * np.ones(len(l1), dtype=np.int_)
jnext = len(l2) * np.ones(len(l1), dtype=np.int_)

# reverse match: if string in l2 is matched to a string in  l1
rev_match = FREE * np.ones(len(l2), dtype=np.int_)

while(np.any(match == FREE)): # while there is an un-matched string
    i = np.where(match == FREE)[0][0] # take the first un-matched index
    jnext[i] -= 1
    j = order[i, jnext[i]] # next l2 string that l1[i] will be matched against
    if rev_match[j] == FREE:  # l2[j] is free, pair i & j together
        rev_match[j], match[i] = i, j
        print('{:30} --> {}'.format(l1[i], l2[j]))
    else: # l2[j] is already paired
        l = rev_match[j] # current l1 string that l2[j] is paired with
        if xs[l, j] < xs[i, j]:  # l2[j] is more similar to l1[i] than l1[l]
            match[l] = FREE      # unpair l & j, and pair i & j
            rev_match[j], match[i] = i, j
            print('{:30} -/- {}'.format(l1[l], l2[j]))
            print('{:30} --> {}'.format(l1[i], l2[j]))

要查看最终匹配:

for i, w in enumerate(l1):
    print('{:30} {}'.format(w, l2[match[i]]))

如您所见,在此解决方案中,'topazchino51077' 'topazchini51087'配对,因为这两者更相似:

>>> jf.jaro_distance('topazchini51087', 'topazchino51077')
0.9111
>>> jf.jaro_distance('topazchini51081', 'topazchino51077')
0.8667