找到两个基因组序列中最长的子串

时间:2017-08-28 13:07:06

标签: python python-3.x pattern-matching substring

我有两个序列AAAAAAAAAGAAAAGAAGAAG,AAAGAAG。 正确的答案是AAGAAG。

但是我的代码给了AA。

有时两个字符串将按此顺序排列AAAGAAG,AAAAAAAAAGAAAAGAAGAAG。

这是我的代码

`def longestSubstringFinder(string1, string2):
    string1=string1.strip()
    string2=string2.strip()
    answer = ""
    len1=len(string1)
    len2=len(string2)
    if int(len1)>1 and int(len2)>1:
        for i in range(1,len1,1):
            match = ""
            for j in range(len2):
                if len1>len2:
                    if i+j<len1 and (string1[i+j]==string2[i+j]):
                        match=str(match)+str(string2[i+j])
                        print(match)
                    else:
                        if len(match)>len(answer):
                            answer=match
                            match=""
                elif len2>len1:
                    if i+j<len2 and (string1[i+j]==string2[i+j]):
                        match=str(match)+str(string2[i+j])
                        print(match)
                    else:
                        if len(match)>len(answer):
                            answer=match
                            match=""
    return(answer)`

2 个答案:

答案 0 :(得分:4)

获取两个字符串的所有子字符串,找到两组子字符串的交集,然后找到交集中的最大字符串

def get_all_substrings(input_string):
  length = len(input_string)
  return [input_string[i:j+1] for i in range(length) for j in range(i,length)]

strA = 'AAAAAAAAAGAAAAGAAGAAG'
strB = 'AAAGAAG'

intersection = set(get_all_substrings(strA)).intersection(set(get_all_substrings(strB)))
print(max(intersection, key=len))
>> 'AAAGAAG'

答案 1 :(得分:1)

几个星期前,我偶然发现了Python中的difflib软件包,它非常适合这种工作。

以下是您问题的解决方案:

import difflib
matcher = difflib.SequenceMatcher()

str1 = 'AGAGGAG'
str2 = 'AAAAAAAAAGAAAAGAAGAAG'
matcher.set_seq2(str2)
matcher.set_seq1(str1)

m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of AAAGAAG found in AAAAAAAAAGAAAAGAAGAAG: AAAGAAG
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# AAAAAAAAAGA|AAAGAAG|AAG

str1 = 'AGAG'

matcher.set_seq1(str1)

m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of AGAG found in AAAAAAAAAGAAAAGAAGAAG: AGA
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# AAAAAAAA|AGA|AAAGAAGAAG

str1 = 'XXX'

matcher.set_seq1(str1)

m = matcher.find_longest_match(0, len(str1), 0, len(str2))
print("Longest sequence of {} found in {}: {}".format(str1, str2, str1[m.a: m.a+m.size]))
# Longest sequence of XXX found in AAAAAAAAAGAAAAGAAGAAG: 
print(str2[:m.b]+'|'+str2[m.b:m.b+m.size]+'|'+str2[m.b+m.size:])
# ||AAAAAAAAAGAAAAGAAGAAG

difflib文档says

  

SequenceMatcher计算并缓存有关的详细信息   第二个序列,所以如果你想比较一个序列与许多序列   序列,使用set_seq2()设置常用序列一次和   重复调用set_seq1(),每个其他序列调用一次。

它也非常快!

我定时@AK47太棒了solution而且时间10000 loops, best of 3: 85.2 µs per loop

我的解决方案时间10000 loops, best of 3: 31.6 µs per loop