我正在学习python。我不想使用Biopython,或者任何导入的模块,除了正则表达式,所以我可以理解代码在做什么。
从遗传序列比对中,我想找到间隙/插入点的起始位置和结束位置的位置“ - ”在我的序列中彼此相邻,间隙区域的数量,并计算长度差距地区。例如:
>Seq1
ATC----GCTGTA--A-----T
我想要一个看起来像这样的输出:
Number of gaps = 3
Index Position of Gap region 1 = 3 to 6
Length of Gap region 1 = 4
Index Position of Gap region 2 = 13 to 14
Length of Gap region 2 = 2
Index Position of Gap region 3 = 16 to 20
Length of Gap region 3 = 5
我试图在更大的序列比对上解决这个问题,但我甚至无法远程弄清楚如何做到这一点。
答案 0 :(得分:3)
你想要的是使用正则表达式来找到一个间隙(一个或多个破折号,转换为' - +',加号意味着一个或多个):
import re
seq = 'ATC----GCTGTA--A-----T'
matches = list(re.finditer('-+', seq))
print 'Number of gaps =', len(matches)
print
for region_number, match in enumerate(matches, 1):
print 'Index Position of Gap region {} = {} to {}'.format(
region_number,
match.start(),
match.end() - 1)
print 'Length of Gap region {} = {}'.format(
region_number,
match.end() - match.start())
print
matches
是匹配对象列表enumerate
。您可以查看它是如何工作的。.start()
和返回结束索引的.end()
。请注意,这里的结束索引是你想要的一个,因此我从中减去了1。答案 1 :(得分:1)
以下是我对代码的建议,非常直接,简短易懂,除了re
之外没有任何其他导入的包:
import re
def findGaps(aSeq):
# Get and print the list of gaps present into the sequence
gaps = re.findall('[-]+', aSeq)
print('Number of gaps = {0} \n'.format(len(gaps)))
# Get and print start index, end index and length for each gap
for i,gap in enumerate(gaps,1):
startIndex = aSeq.index(gap)
endIndex = startIndex + len(gap) - 1
print('Index Position of Gap region {0} = {1} to {2}'.format(i, startIndex, endIndex))
print('Length of Gap region {0} = {1} \n'.format(i, len(gap)))
aSeq = aSeq.replace(gap,'*' * len(gap), 1)
findGaps("ATC----GCTGTA--A-----T")
答案 2 :(得分:0)
这是我对这个问题的看法:
import itertools
nucleotide='ATC----GCTGTA--A-----T'
# group the repeated positions
gaps = [(k, sum(1 for _ in vs)) for k, vs in itertools.groupby(nucleotide)]
# text formating
summary_head = "Number of gaps = {0}"
summary_gap = """
Index Position of Gap region {0} = {2} to {3}
Length of Gap region {0} = {1}
"""
# Print output
print summary_head.format(len([g for g in gaps if g[0]=="-"]))
gcount = 1 # this will count the gap number
position = 0 # this will make sure we know the position in the sequence
for i, g in enumerate(gaps):
if g[0] == "-":
gini = position # start position current gap
gend = position + g[1] - 1 # end position current gap
print summary_gap.format(gcount, g[1], gini, gend)
gcount+=1
position += g[1]
这会产生您的预期输出:
# Number of gaps = 3
# Index Position of Gap region 1 = 3 to 6
# Length of Gap region 1 = 4
# Index Position of Gap region 2 = 13 to 14
# Length of Gap region 2 = 2
# Index Position of Gap region 3 = 16 to 20
# Length of Gap region 3 = 5
编辑:使用PANDAS替代
import itertools
import pandas as pd
nucleotide='ATC----GCTGTA--A-----T'
# group the repeated positions
gaps = pd.DataFrame([(k, sum(1 for _ in vs)) for k, vs in itertools.groupby(nucleotide)])
gaps.columns = ["type", "length"]
gaps["ini"] = gaps["length"].cumsum() - gaps["length"]
gaps["end"] = gaps["ini"] + gaps["length"] - 1
gaps = gaps[gaps["type"] == "-"]
gaps.index = range(1, gaps.shape[0] + 1)
summary_head = "Number of gaps = {0}"
summary_gap = """
Index Position of Gap region {0} = {1[ini]} to {1[end]}
Length of Gap region {0} = {1[length]}
"""
print summary_head.format(gaps.shape[0])
for index, row in gaps.iterrows():
print summary_gap.format(index, row)
此备选方案的好处是,如果要分析多个序列,可以将序列标识符添加为额外列,并将所有序列中的所有数据都添加到单个数据结构中;像这样的东西:
import itertools
import pandas as pd
nucleotides=['>Seq1\nATC----GCTGTA--A-----T',
'>Seq2\nATCTCC---TG--TCGGATG-T']
all_gaps = []
for nucleoseq in nucleotides:
seqid, nucleotide = nucleoseq[1:].split("\n")
gaps = pd.DataFrame([(k, sum(1 for _ in vs)) for k, vs in itertools.groupby(nucleotide)])
gaps.columns = ["type", "length"]
gaps["ini"] = gaps["length"].cumsum() - gaps["length"]
gaps["end"] = gaps["ini"] + gaps["length"] - 1
gaps = gaps[gaps["type"] == "-"]
gaps.index = range(1, gaps.shape[0] + 1)
gaps["seqid"] = seqid
all_gaps.append(gaps)
all_gaps = pd.concat(all_gaps)
print(all_gaps)
将生成一个数据容器:
type length ini end seqid
1 - 4 3 6 Seq1
2 - 2 13 14 Seq1
3 - 5 16 20 Seq1
1 - 3 6 8 Seq2
2 - 2 11 12 Seq2
3 - 1 20 20 Seq2
您可以随后格式化:
for k in all_gaps["seqid"].unique():
seqg = all_gaps[all_gaps["seqid"] == k]
print ">{}".format(k)
print summary_head.format(seqg.shape[0])
for index, row in seqg.iterrows():
print summary_gap.format(index, row)
看起来像:
>Seq1
Number of gaps = 3
Index Position of Gap region 1 = 3 to 6
Length of Gap region 1 = 4
Index Position of Gap region 2 = 13 to 14
Length of Gap region 2 = 2
Index Position of Gap region 3 = 16 to 20
Length of Gap region 3 = 5
>Seq2
Number of gaps = 3
Index Position of Gap region 1 = 6 to 8
Length of Gap region 1 = 3
Index Position of Gap region 2 = 11 to 12
Length of Gap region 2 = 2
Index Position of Gap region 3 = 20 to 20
Length of Gap region 3 = 1
答案 3 :(得分:0)
关于这一点比使用正则表达式有点冗长的方式,但你可以找到连字符的索引并使用第一个差异对它们进行分组:
>>> def get_seq_gaps(seq):
... gaps = np.array([i for i, el in enumerate(seq) if el == '-'])
... diff = np.cumsum(np.append([False], np.diff(gaps) != 1))
... un = np.unique(diff)
... yield len(un)
... for i in un:
... subseq = gaps[diff == i]
... yield i + 1, len(subseq), subseq.min(), subseq.max()
>>> def report_gaps(seq):
... gaps = get_seq_gaps(seq)
... print('Number of gaps = %s\n' % next(gaps), sep='')
... for (i, l, mn, mx) in gaps:
... print('Index Position of Gap region %s = %s to %s' % (i, mn, mx))
... print('Length of Gap Region %s = %s\n' % (i, l), sep='')
>>> seq = 'ATC----GCTGTA--A-----T'
>>> report_gaps(seq)
Number of gaps = 3
Index Position of Gap region 1 = 3 to 6
Length of Gap Region 1 = 4
Index Position of Gap region 2 = 13 to 14
Length of Gap Region 2 = 2
Index Position of Gap region 3 = 16 to 20
Length of Gap Region 3 = 5
首先,这会形成一个带有连字符的索引数组:
>>> gaps
array([ 3, 4, 5, 6, 13, 14, 16, 17, 18, 19, 20])
第一个差异不是1的地方表示休息。抛弃另一个假,以保持长度。
>>> diff
array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2])
现在获取这些组的唯一元素,将gaps
约束到相应的索引,并找到它的最小值/最大值。