Question

我需要弄清楚A是否是B的子串，同时原谅了一定数量的错误（n）。 A和B都是字符串，n是整数。最大的问题是我内心的循环。我不知道如何编码它以便输出我想要的东西。它应循环通过B寻找A与n错误的匹配。如果A是“abcd”而B是“bdefabddghl”那么输出应该是A在4位被发现且有1个错误。这是我现在的代码。我只需要知道如何编写内部while循环。非常感谢任何帮助。

def subsequence(A,B,n):
answer = -1             # assume worst case for there-exists type of loop
j = 0                   # index b
while j<=len(B)-len(A) and answer==-1:  # there exists j in b loop
    bx = j              # assume a is found in b starting at b[j]
    i = 0               # assume best case for a for-all type of loop
    axy = n             # accumulator for n
        while i<len(A) and bx==j and axy > 0:   # for all i in a
        if A[i] == B[j-i] and axy > 0:
            bx = j
            axy = n         # accumulator for n
        if A[i] != B[j-i] and axy > 0:
            axy -= 1
        i+=1
        # while i
    j+=1
# while j
end = "A best match with " + str(n-axy) + " errors was found starting at position " + str(bx)."
return end
print subsequence("abcd","bcjabddec",3)

谢谢

Answer 1

我假设你正在进行这项练习，或者与它非常相似的事情：http://www.cs.hofstra.edu/~cscccl/csc15p/dnalab.txt

我不确定起始示例是否特别有用。它很难理解，因此很难以指定的方式适应。我会就如何解决这些问题提出一些建议：

将问题分解为更容易解决的小问题。
编写测试每个函数的代码可以得到您期望的结果。
使用变量和函数名称清楚地表明每个名称的含义，以便代码更易于为您自己和您展示的代码理解。

你要做的是沿字符串B滑动字符串A，并在每个位置测试匹配的好坏，记住你找到的最佳匹配。分离问题的一个简单部分是衡量两个字符串的特定对齐方式所获得的匹配程度。

以下是我对解决方案的一般形态的建议：

def string_distance(A, B, index_b):
    '''Count how many substitutions are required to
    change string A into the string beginning at
    index_b in string B.'''
    return 999 # FIXME

def subsequence(A, B, n):
    '''Search for the substring of B that is equal to
    A except for the minimal number of substitutions,
    and return a tuple of the index of that substring
    in B and the number of substitutions required,
    unless there is no such substring with n or fewer
    substitutions, in which case return the tuple of
    (-1, -1).'''
    answer_distance = -1
    answer_index = -1
    index = 0
    while index <= len(B) - len(A):
        substitutions_required = string_distance(A, B, index)
        # Does a match at this location need n or fewer substitutions?
        is_close_enough = False # FIXME
        # Is a match at this location better than the best match so far?
        is_better = False # FIXME
        if is_close_enough and is_better:
            answer_distance = substitutions_required
            answer_index = index
        index += 1
    return answer_index, answer_distance

以下是一些基本测试的结果：

def test_string_distance():
    test_data = [
        # a         b          index_b   expected
        ("ABC",    "ABCDEF",   0,        0),
        ("ABX",    "ABCDEF",   0,        1),
        ("XYZ",    "ABCDEF",   0,        3),
        ("XBC",    "ABCDEF",   0,        1),
        ("CEE",    "ABCDEF",   2,        1),
        ("DEF",    "ABCDEF",   3,        0),
        ("AAAAA",  "BBBBBBBB", 3,        5),
        ("BAAAA",  "BBBBBBBB", 3,        4),
        ("ABAAB",  "BBBBBBBB", 3,        3),
    ]
    for a, b, index_b, expected in test_data:
        result = string_distance(a, b, index_b)
        if result != expected:
            print "string_distance({}, {}, {}) was {} but should be {}".format(
                    a, b, index_b, result, expected)

def test_subsequence():
    test_data = [
        # A         B               n   expected
        ("AXY",    "AYAXXXAAYYAX",  3,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  2,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  1,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  0,  (-1,-1)),
        ("XAAY",   "AYAXXXAAYYAX",  2,  (5,0)),
        ("XAAY",   "XXAXAAXAAY",    2,  (6,0)),
        ("ABCDEF", "ZZAABAXCDEEEF", 3,  (5,2)),
        ("ABCDEF", "ZZAABAXCDEEEF", 2,  (5,2)),
        ("ABCDEF", "ZZAABAXCDEEEF", 1,  (-1,-1)),
        ("ABCDEF", "ZZAABXBCDEEEF", 3,  (5,2)),
    ]
    for a, b, n, expected in test_data:
        result = subsequence(a, b, n)
        if result != expected:
            print "test_subsequence({}, {}, {}) was {} but should be {}".format(
                    a, b, n, result, expected)

if __name__ == '__main__':
    test_string_distance()
    test_subsequence()

首先找出将传递这些测试的string_distance的实现。然后得到后续工作。

Answer 2

您应该查看difflib.get_close_matches函数。它几乎完全相同。如果你真的想要实现它，那么只需构建一个相同长度的所有子序列的列表并按匹配数对它们进行排序就可以轻松实现它。

def subsequence(a, b, n):

    sb = [(i, b[i:i+len(a)]) for i in xrange(len(b) - len(a) + 1)]

    def count_matches(s):
        return sum(int(x == y) for (x, y) in zip(a, s[1]))

    sb.sort(key=count_matches, reverse=True)

    best = sb[0]

    e = len(a) - count_matches(best)
    i = best[0]
    w = best[1]

    print 'A best match with %d errors was found starting at position %d' % (e, i)

在B中找到A的子串，有n个错误

2 个答案: