在列表中查找与参考列表相差最多N个字符的值

时间:2018-06-20 12:38:47

标签: python string

我有一个类似以下的列表:

Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']

还有这样的参考列表:

Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']

我想从Test中提取与Ref中任何一项不同的N个或更少字符的值。

例如,如果N = 1,则仅应输出Test的前两个元素。如果N = 2,则所有三个元素均符合此条件,应返回。

应该注意的是,我正在寻找相同的字符长度值(ASDFGY-> ASDFG匹配不适用于N = 1),所以我想要比levensthein距离更有效的东西。

我在ref中有1000多个值,在Test中有2亿个值,因此效率是关键。

3 个答案:

答案 0 :(得分:1)

使用带有STM32 F1的世代表达式:

sum

答案 1 :(得分:1)

newer regex module提供了“模糊”匹配的可能性:

import regex as re

Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']


for item in Test:
    rx = re.compile('(' + item + '){s<=3}')
    for r in Ref:
        if rx.search(r):
            print(rf'{item} is similar to {r}')

这产生

ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA

您可以通过{s<=3}部分进行控制,该部分允许进行三个或更少的替换。


要配对,可以写

pairs = [(origin, difference) 
        for origin in Test 
        for rx in [re.compile(rf"({origin}){{s<=3}}")]
        for difference in Ref
        if rx.search(difference)]

会产生什么

Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']

以下输出:

[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'), 
 ('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'), 
 ('ZXCVB', 'ZXCAA')]

答案 2 :(得分:1)

使用difflib

演示:

import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
    c = 0
    for j,s in enumerate(difflib.ndiff(i, v)):
        if s.startswith("-"):
            c += 1
    if c <= N:
        result.append( i )
print(result)

输出:

['ASDFGH', 'QWERTYU']