我正在尝试编写一个脚本,当提供两个字符串时将执行两个函数:
1 即可。找到从<div id="wrapper">
<!--1.Header Section-->
<div class="header">
<div class="container">
<div class="nav">
<input type="checkbox" id="toggle" />
<label for="toggle">☰</label>
<header>
<div id="band_Logo">
<a href="#wrapper"></a>
</div>
<div id="menu" class="menu" style="line-height:94px; ">
<ul style="margin:0">
<li><a href="#wrapper" style="color:#ffab00; margin-top: 2px;">Home</a></li>
<li><a href="#section3" style="margin-top: 2px; color:#ffab00;">Events</a></li>
<li><a href="#projectDesc" style="margin-top: 2px; color:#ffab00;">Projects</a></li>
<li><a href="soofyan-unplugged.html" style="margin-top: 2px; color:#ffab00;">Unplugged</a></li>
<li><a href="../Parallax/gallery.html" style="margin-top: 2px; color:#ffab00;">Gallery</a></li>
<li><a href="../Parallax/videos.html" style="margin-top: 2px; color:#ffab00;">Videos</a></li>
<li><a href="../Parallax/about.html" style="margin-top: 2px; color:#ffab00;">About</a></li>
<li><a href="../Parallax/contact.html" style="margin-top: 2px; color:#ffab00;">Contact Us</a></li>
</ul>
</div>
</header>
</div>
</div>
</div>
</div>
开始的最长字符序列,两个字符串中的字符相同
pos[0]
2 即可。找到两个字符串中存在的最长字符
Seq1 = 'ATCCTTAGC'
Seq2 = 'ATCCAGCAATTC'
^^^^ Match from pos[0] to pos[3]
Pos: 0:3
Length: 4
Seq: ATCC
要完成问题1,我有以下内容:
Seq1 = 'TAGCTCCTTAGC' # Contains 'TCCTT'
Seq2 = 'GCAGCCATCCTTA' # Contains 'TCCTT'
^ No match at pos[0]
Pos1: 4:8
Pos2 7:11
Length: 5
Seq: TCCTT
#!/usr/bin/python
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
print("Upstream: %s\nDownstream: %s\n") % (upstream_seq, downstream_seq)
mh = 0
pos_count = 0
seq = ""
position =""
longest_hom=""
for i in range(len(upstream_seq)):
pos_count += 1
if upstream_seq[i] == downstream_seq[i]:
mh += 1
seq += upstream_seq[i]
position = pos_count
longest_hom = mh
else:
mh = 0
break
print("Pos: 0:%s\nLength: %s\nSeq: %s\n") % (position , longest_hom, seq)
我遇到问题 2 时遇到问题。到目前为止,我已经考虑使用BioPython's pairwise2在两个序列之间进行对齐。然而,在这种情况下,我只想要完美的匹配(没有间隙,没有扩展),我只想看到最长的序列,而不是我似乎得到的共识:
Upstream: ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC
Downstream: ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG
Pos: 0:5
Length: 5
Seq: ATACA
from Bio import pairwise2 as pw2
global_align = pw2.align.globalms(upstream_seq, downstream_seq, 3, -1, -.5, -.5)
print(global_align[0])
问题:如何找到两个字符串中存在的最长字符数?
答案 0 :(得分:8)
以下是问题1 的简短代码:
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
common_prefix = ''
for x,y in zip(upstream_seq, downstream_seq):
if x == y:
common_prefix += x
else:
break
print(common_prefix)
# ATACA
问题2 的天真方法是简单地为每个字符串生成一组每个子字符串,计算它们的交集并按长度排序:
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
def all_substrings(string):
n = len(string)
return {string[i:j+1] for i in range(n) for j in range(i,n)}
print(all_substrings('ABCA'))
# {'CA', 'BC', 'ABC', 'C', 'BCA', 'AB', 'A', 'B', 'ABCA'}
print(all_substrings(upstream_seq) & all_substrings(downstream_seq))
# {'AAAG', 'CA', 'A', 'AAC', 'TGTT', 'ACT', 'CTTAG', 'GCT', 'ATAC', 'AAAA', 'TTTA', 'AAT', 'GTGC', 'CTT', 'AAAAG', 'TTTG', 'CGAA', 'AA', 'CGAAAAG', 'GCC', 'ACA', 'TGCC', 'AAATAA', 'CTCC', 'TTTTT', 'CGCC', 'CAC', 'GAG', 'CTC', 'CGAAAA', 'ATC', 'TCA', 'GA', 'CGC', 'TGT', 'GT', 'GC', 'GAAA', 'ACTTT', 'AAG', 'TTTT', 'CT', 'AATA', 'TCC', 'CGAAA', 'GAA', 'GAAAAG', 'GTT', 'AG', 'TC', 'AAAAT', 'CC', 'TTT', 'AATAA', 'CTTTT', 'ACTT', 'TTA', 'CTTT', 'GCTT', 'GCCG', 'GTG', 'TACA', 'TT', 'GCG', 'TTTTTG', 'TAG', 'TTG', 'TTAG', 'AAATA', 'CTTTTT', 'AAAT', 'TAA', 'ACG', 'TG', 'GCCT', 'G', 'TAC', 'CCT', 'TCT', 'ATA', 'CTTA', 'CCG', 'CG', 'ATAA', 'GG', 'ATACA', 'AGA', 'TGC', 'C', 'T', 'AT', 'GAAAA', 'CGA', 'GAAAAT', 'TA', 'AC', 'AAA', 'TTTTG'}
print(max(all_substrings(upstream_seq) & all_substrings(downstream_seq), key=len))
# CGAAAAG
如果您想要更有效的方法,则应使用suffix tree。
如果您不想重新发明轮子,可以使用difflib.SequenceMatcher.find_longest_match
答案 1 :(得分:2)
longest common substring problem可以用几种方式处理,有些方面比其他方式更有效。一个非常有效的解决方案涉及动态编程,它在python 2和3中的实现都可以在wikibooks中找到。一个天真的解决方案,更简单,更容易理解,但效率更低,是这个:
def longest_common_substring(s1, s2):
current_match_start = -1
current_match_end = -1
best_match_start = current_match_start
best_match_end = current_match_end
min_len = min(len(s1), len(s2))
for i in range(min_len):
if s1[i] == s2[i]:
current_match_start = current_match_end = i
j = 0
while s1[i+j] == s2[i+j] and i+j < min_len:
j += 1
current_match_end = current_match_start + j
if current_match_end - current_match_start > best_match_end - best_match_start:
best_match_start = current_match_start
best_match_end = current_match_end
return s1[best_match_start:best_match_end]
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
print(longest_common_substring(upstream_seq, downstream_seq))
答案 2 :(得分:1)
正如Eric Duminil的回答所述,解决这个问题的一种方法是使用difflib.SequenceMatcher.find_longest_match:
.img-responsive { margin-left: 0; max-width: 100%;}
figure {text-align: center;}
from difflib import SequenceMatcher
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
s = SequenceMatcher(None, upstream_seq, downstream_seq)
match = s.find_longest_match(0, len(upstream_seq), 0, len(downstream_seq))
print(match)
upstream_start = match[0]
upstream_end = match[0]+match[2]
seq = upstream_seq[match[0]:(match[0]+match[2])]
downstream_start = match[1]
downstream_end = match[1]+match[2]
print("Upstream seq: %s\nstart-stop: %s-%s\n") % (seq, upstream_start, upstream_end)
print("Downstream seq: %s\nstart-stop: %s-%s\n") % (seq, downstream_start, downstream_end)