Question

我正在尝试解决DNA问题，该问题更多是LCS问题的改进版本（？）。在问题中，有一个字符串，它是字符串和半子字符串，它允许部分字符串跳过一个或一个字母。例如，对于字符串“ desktop”，它具有半子字符串{"destop", "dek", "stop", "skop","desk","top"}，所有半子字符串都跳过一个字母或不跳过任何字母。

现在，我得到了两个由{a,t,g,c}组成的DNA字符串。我正在尝试找到最长的半子字符串LSS。如果有多个LSS，请以最快的顺序打印出来。

例如，两个dnas {attgcgtagcaatg, tctcaggtcgatagtgac}打印出"tctagcaatg"

和aaaattttcccc, cccgggggaatatca打印出"aattc"

我正在尝试使用常见的LCS算法，但无法解决表格问题，尽管我确实解决了没有跳过字母的问题。有什么建议吗？

Answer 1

让g(c, rs, rt)代表最长的公用字符串半子串S和T，以rs和rt结尾，其中{{1} }和rs分别是字符rt在c和S中的排名出现位置，而T是允许的跳过次数。然后，我们可以形成一个递归，必须对S和T中的所有K对执行。

JavaScript代码：

Answer 2

这是用Python编写的LCS动态编程解决方案的变体。

首先，我要为可以使用跳过规则从每个字符串组成的所有子字符串构建一个Suffix Tree。然后，我将后缀树相交。然后，我正在寻找可以从该交集树生成的最长字符串。

请注意，从技术上讲，这是O(n^2)。最坏的情况是两个字符串都是同一字符，一遍又一遍地重复。因为您在逻辑上会遇到很多类似的东西，所以“一个字符串中位置42的“ l”可能与另一个字符串中位置54的位置l相匹配。但实际上，它将是O(n)。

def find_subtree (text, max_skip=1):
    tree = {}
    tree_at_position = {}

    def subtree_from_position (position):
        if position not in tree_at_position:
            this_tree = {}
            if position < len(text):
                char = text[position]
                # Make sure that we've populated the further tree.
                subtree_from_position(position + 1)

                # If this char appeared later, include those possible matches.
                if char in tree:
                    for char2, subtree in tree[char].iteritems():
                        this_tree[char2] = subtree

                # And now update the new choices.
                for skip in range(max_skip + 1, 0, -1):
                    if position + skip < len(text):
                        this_tree[text[position + skip]] = subtree_from_position(position + skip)

                tree[char] = this_tree

            tree_at_position[position] = this_tree

        return tree_at_position[position]

    subtree_from_position(0)

    return tree


def find_longest_common_semistring (text1, text2):
    tree1 = find_subtree(text1)
    tree2 = find_subtree(text2)

    answered = {}
    def find_intersection (subtree1, subtree2):
        unique = (id(subtree1), id(subtree2))
        if unique not in answered:
            answer = {}
            for k, v in subtree1.iteritems():
                if k in subtree2:
                    answer[k] = find_intersection(v, subtree2[k])
            answered[unique] = answer
        return answered[unique]


    found_longest = {}
    def find_longest (tree):
        if id(tree) not in found_longest:
            best_candidate = ''
            for char, subtree in tree.iteritems():
                candidate = char + find_longest(subtree)
                if len(best_candidate) < len(candidate):
                    best_candidate = candidate
            found_longest[id(tree)] = best_candidate
        return found_longest[id(tree)]

    intersection_tree = find_intersection(tree1, tree2)
    return find_longest(intersection_tree)


print(find_longest_common_semistring("attgcgtagcaatg", "tctcaggtcgatagtgac"))

DNA子序列动态规划问题

2 个答案: