我有一个类似以下的列表:
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
还有这样的参考列表:
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
我想从Test
中提取与Ref
中任何一项不同的N个或更少字符的值。
例如,如果N = 1,则仅应输出Test的前两个元素。如果N = 2,则所有三个元素均符合此条件,应返回。
应该注意的是,我正在寻找相同的字符长度值(ASDFGY-> ASDFG匹配不适用于N = 1),所以我想要比levensthein距离更有效的东西。
我在ref中有1000多个值,在Test中有2亿个值,因此效率是关键。
答案 0 :(得分:1)
使用带有STM32 F1
的世代表达式:
sum
答案 1 :(得分:1)
newer regex
module提供了“模糊”匹配的可能性:
import regex as re
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
for item in Test:
rx = re.compile('(' + item + '){s<=3}')
for r in Ref:
if rx.search(r):
print(rf'{item} is similar to {r}')
这产生
ASDFGH is similar to ASDFGY
ASDFGH is similar to ASDFGI
ASDFGH is similar to ASDFGX
QWERTYU is similar to QWERTYI
ZXCVB is similar to ZXCAA
您可以通过{s<=3}
部分进行控制,该部分允许进行三个或更少的替换。
pairs = [(origin, difference)
for origin in Test
for rx in [re.compile(rf"({origin}){{s<=3}}")]
for difference in Ref
if rx.search(difference)]
会产生什么
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA', 'ASDFGI', 'ASDFGX']
以下输出:
[('ASDFGH', 'ASDFGY'), ('ASDFGH', 'ASDFGI'),
('ASDFGH', 'ASDFGX'), ('QWERTYU', 'QWERTYI'),
('ZXCVB', 'ZXCAA')]
答案 2 :(得分:1)
使用difflib
演示:
import difflib
N = 1
Test = ['ASDFGH', 'QWERTYU', 'ZXCVB']
Ref = ['ASDFGY', 'QWERTYI', 'ZXCAA']
result = []
for i,v in zip(Test, Ref):
c = 0
for j,s in enumerate(difflib.ndiff(i, v)):
if s.startswith("-"):
c += 1
if c <= N:
result.append( i )
print(result)
输出:
['ASDFGH', 'QWERTYU']