Question

我有一个单词列表（将近7个项目），我想删除与其他单词几乎相同的项目（即，如果我的单词是“ Agency Account Bank Agreement”，我想删除“ Agency Account Bank”之类的单词根据协议”）。

要估算一个单词是否与另一个单词接近，我使用了Python中水母包的Jaro距离。

我当前的代码是：

corpus3 = ['Agency Account Bank Agreement', 'Agent', 'Agency Account Bank Agreement Pursuant',
       'Agency Account Bank Agreement Notwithstanding', 'Agents', 'Agent', 'Reinvestment Period']
threshold = 0,85
for a, b in itertools.combinations(corpus3, 2):
    if len(a.split()) >= 2 or len(b.split()) >= 2:               
        jf = jellyfish.jaro_distance(a, b)
        if jf > threshold:
            if a in new_corpus and b in new_corpus:                
                continue
            else:
                if len(a.strip()) < len(b.strip()):
                    kw = a
                    if not new_corpus:
                        new_corpus.append(a)
                    else:    
                        for item in new_corpus:
                            jf = jellyfish.jaro_distance(kw, item)
                            if jf < threshold:
                                new_corpus.append(kw)

这就是我最后想要的：

new_corpus = ['Agency Account Bank Agreement', 'Agent', 'Reinvestment Period']

Answer 1

假设您有此列表：

numchars = ['one', 'ones', 'two', 'twos', 'three', 'threes']

比方说，您认为one与ones太相似了，您只想保留两者之一，以使修订后的清单与此类似：

numchars = ['ones', 'twos', 'threes']

您可以这样做以消除您认为过于相似的内容：

for x in numchars:
    if any(lower_threshold < jellyfish.jaro_distance(x, _x) and x != _x for _x in numchars):
        numchars.remove(x)

根据您设置的阈值以及列表的顺序，这可能会产生如下结果：

numchars = ['ones', 'twos', 'threes']

此例程中的主要逻辑在此行中：

if any(lower_threshold < jellyfish.jaro_distance(x, _x) and x != _x for _x in numchars):

这表示，如果列表numchars中的任何成员与列表中排除其自身的所有成员相比，相似度都大于您定义的lower_threshold，则应从列表中删除该成员，例如：numchars.remove(x)。此外，and x != _x条件避免将成员注册为与其自身太相似。

但是，可以说这个三明治的肉是这样的：

numchars.remove(x)

此语句可确保一旦删除one与ones过于相似，则在下一次迭代期间one不再是列表的成员并且不会进行比较到ones的方式也可以删除ones。这种方法最终会导致列表为空。

一旦您开始只想保留复数形式或其他某些形式的相似匹配组，您就会打开另一整个蠕虫罐头。

从列表中删除类似的项目

1 个答案: