在Python中检查存在于较长字符串中的模糊/近似子字符串?

时间:2013-07-19 07:51:40

标签: python python-2.7 fuzzy-search

使用像leveinstein(leveinstein或difflib)这样的算法,很容易找到近似匹配.eg。

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

可以通过根据需要确定阈值来检测模糊匹配。

当前要求:根据较大字符串中的阈值查找模糊子字符串。

例如。

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

一个强力解决方案是生成长度为N-1到N + 1(或其他匹配长度)的所有子串,其中N是query_string的长度,并逐个使用levenstein并查看阈值。

python中是否有更好的解决方案,最好是python 2.7中包含的模块,或外部可用的模块。

更新:Python正则表达式模块工作得很好,虽然它比内置的re模块慢一点,因为模糊子字符串情况,由于额外的操作,这是一个明显的结果。 期望的输出是好的,并且可以容易地定义对模糊度的控制。

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

3 个答案:

答案 0 :(得分:17)

如何使用difflib.SequenceMatcher.get_matching_blocks

>>> import difflib
>>> large_string = "thelargemanhatanproject"
>>> query_string = "manhattan"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.8888888888888888

>>> query_string = "banana"
>>> s = difflib.SequenceMatcher(None, large_string, query_string)
>>> sum(n for i,j,n in s.get_matching_blocks()) / float(len(query_string))
0.6666666666666666

<强>更新

import difflib

def matches(large_string, query_string, threshold):
    words = large_string.split()
    for word in words:
        s = difflib.SequenceMatcher(None, word, query_string)
        match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
        if len(match) / float(len(query_string)) >= threshold:
            yield match

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
print list(matches(large_string, query_string, 0.8))

以上代码打印:['manhatan', 'manhattn']

答案 1 :(得分:13)

很快应该替换的新正则表达式库包括模糊匹配。

https://pypi.python.org/pypi/regex/

模糊匹配语法看起来相当具有表现力,但这可以让您匹配一个或更少的插入/添加/删除。

import regex
regex.match('(amazing){e<=1}', 'amaging')

答案 2 :(得分:11)

最近我为Python编写了一个对齐库:https://github.com/eseraygun/python-alignment

使用它,您可以在任何序列对上使用任意评分策略执行全局和局部对齐。实际上,在您的情况下,您需要半局部对齐,因为您不关心query_string的子串。我在下面的代码中使用局部对齐和一些启发式模拟了半局部算法,但很容易扩展库以便正确实现。

以下是针对您的案例修改的README文件中的示例代码。

from alignment.sequence import Sequence, GAP_ELEMENT
from alignment.vocabulary import Vocabulary
from alignment.sequencealigner import SimpleScoring, LocalSequenceAligner

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"

# Create sequences to be aligned.
a = Sequence(large_string)
b = Sequence(query_string)

# Create a vocabulary and encode the sequences.
v = Vocabulary()
aEncoded = v.encodeSequence(a)
bEncoded = v.encodeSequence(b)

# Create a scoring and align the sequences using local aligner.
scoring = SimpleScoring(1, -1)
aligner = LocalSequenceAligner(scoring, -1, minScore=5)
score, encodeds = aligner.align(aEncoded, bEncoded, backtrace=True)

# Iterate over optimal alignments and print them.
for encoded in encodeds:
    alignment = v.decodeSequenceAlignment(encoded)

    # Simulate a semi-local alignment.
    if len(filter(lambda e: e != GAP_ELEMENT, alignment.second)) != len(b):
        continue
    if alignment.first[0] == GAP_ELEMENT or alignment.first[-1] == GAP_ELEMENT:
        continue
    if alignment.second[0] == GAP_ELEMENT or alignment.second[-1] == GAP_ELEMENT:
        continue

    print alignment
    print 'Alignment score:', alignment.score
    print 'Percent identity:', alignment.percentIdentity()
    print

minScore=5的输出如下:

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t - i
m a n h a t t a n
Alignment score: 5
Percent identity: 77.7777777778

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

如果删除minScore参数,您将只获得最佳得分匹配。

m a n h a - t a n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

m a n h a t t i n
m a n h a t t a n
Alignment score: 7
Percent identity: 88.8888888889

请注意,库中的所有算法都有O(n * m)时间复杂度,nm是序列的长度。