在另一个序列中找到最相似的子序列

时间:2014-11-11 17:55:24

标签: algorithm search bioinformatics fuzzy-search string-algorithm

我需要编写一个算法,在S1中找到最相似的子串到另一个字符串S2(S1中的子串,与S2有最小的汉明距离,换句话说),在N log(N)中,其中N = len( S1)+ len(S2),len(S2)< = len(S1)。

例如:
S1 = AGTCAGTC
S2 = GTC
答案:GTC(距离0)

S1 = AAGGTTCC
S2 = TCAA
答案:TTCC(距离3)

时间复杂度不得超过O(N Log(N))。空间复杂性无关紧要。

LCS(最长公共子序列)在我的情况下不起作用。例如:


    S1 = GAATCCAGTCTGTCT
    S2 = AATATATAT

    LCS answer: AATCCAGTC
    GAATCCAGTCTGTCT
     |||
     AATATATAT

    right answer: AGTCTGTCT
    GAATCCAGTCTGTCT
          | | | | |
          AATATATAT

1 个答案:

答案 0 :(得分:1)

我认为你正试图解决longest common subsequence problem。此问题涉及尝试查找将一个字符串转换为另一个字符串所需的最少量修改。

你说你正在尝试编写一个算法来做到这一点。看看LCS问题并尝试使用Google搜索它,如果您想要自己编写算法,或者可以利用命令行实用程序差异。

就像最常见的子序列问题一样,diff需要两个文件并找到一系列的添加和删除,这些修改会导致对file1转换为file2的修改次数最少.Diff是非常有效的,我想它会足够快到达你的目的。我很确定大多数差异的空间和时间复杂度为O(Nlog(N))或更低,但您可能想要自己验证。更多关于差异http://en.wikipedia.org/wiki/Diff_utility

我写了一个小python程序,它使用diff来找到最长的连续公共子序列。这适用于unix,但我相信你使用的平台有一个diff实用程序。

这些文件每行应该有一个字符。您可能必须编写程序来对文件执行转换。

import sys
import subprocess
import os
import re

def getLinesFromShellCommand(command):
        p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=os.getcwd())
        lines = []
        for curLine in p.stdout.readlines():
                lines.append(curLine)
        retVal = p.wait()
        return (retVal, lines)

#We need to find the longest sequence of lines that start with a space( a space meant the line was in common).
#We could use some kind of while loop to detect the start and end of a group of lines that start with spaces. 
#However there is a much simpler method. Create a string by concatenating the first char from each line and use regex
#to find all the subsequences that start with spaces. After that just take the longest subsequence.
def findLongestCommonSubsequence(diffOutput):
    outputFirstCharStr = reduce(lambda x, y: x+y[:1], diffOutput, "")
    commonSubsequences = [(m.start(0), m.end(0)) for m in re.finditer(" +", outputFirstCharStr)]
    longestCommonSubsequence = max(commonSubsequences, key=lambda (start,end) : end - start)
    return longestCommonSubsequence

def main():
    if len(sys.argv) != 3:
        sys.stderr.write("usage: python LongestCommonSubsequence.py <file1> <file2>\n")
        sys.exit(1)
    commandOutput = getLinesFromShellCommand("diff -u {0} {1}".format(sys.argv[1], sys.argv[2]))
    if commandOutput[0] != 1: # apparently diff returns 1 if its successful
        sys.stderr.write("diff failed with input files.\n")
        sys.exit(1)
    diffOutput = commandOutput[1]
    diffOutput = diffOutput[3:] # the first three lines are not needed
    longestCommonSubsequence = findLongestCommonSubsequence(diffOutput)
    print "Indices:",longestCommonSubsequence
    for i in range(longestCommonSubsequence[0], longestCommonSubsequence[1]):
        print diffOutput[i],

if __name__ == "__main__":
    main()

使用

python LongestCommonSubsequence.py f1.txt f2.txt

如果f1.txt和f2.txt是你给出的第二个例子,则输出:

Indices: (5, 7)
 T
 C

修改:我看到了您对上述原因无效的评论。您可能对这篇文章感兴趣:https://cs.stackexchange.com/questions/2519/efficiently-calculating-minimum-edit-distance-of-a-smaller-string-at-each-positi