我有一组文本文件,其中包含两个相同长度的非常大的字符集。字符集是DNA序列,所以我打算将它们称为seq_1
和seq_2
,它们一起称为alignment
。文件看起来像这样:
>HEADER1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER1
下的序列1中的可能字符为ACGTN-
,>HEADER2
下的序列2中的可能字符为ACGTN-*
。
我想分析序列并返回两个索引列表,我将其称为valid
和mismatch
。
valid
包含所有(从1开始)的索引,其中两个序列中的位置(“对齐”)都在集合ACGT
中; mismatch
包含所有(从1开始)的索引,其中对齐中的位置在集合ACGT
中但彼此不匹配。因此mismatch
是valid
的子集。
最后一个条件是我 NOT 在序列1为"-"
的位置递增索引计数器,因为这些在我正在使用的坐标系中被视为“间隙”。
此示例显示了我预期输出的对齐方式:
seq_1 = 'CT-A-CG' # total length = 5 because of the two gaps
seq_2 = 'CNCTA*G'
valid = [1,3,5] # all indices where both sequences are in 'ACGT' without counting gaps
mismatch = [3] # subset of valid where alignment does not match
我希望改进我当前的代码(下面),它涉及序列的正则表达式提取,非间隙站点的压缩和枚举到生成器对象中然后 - 主要的耗时步骤 - 循环通过这个发电机并填写两个清单。我觉得必须有一个基于数组或itertools
解决这个问题的解决方案,它比序列中的for循环更有效,它们的索引压缩在一起,我正在寻找建议。
代码:
def seq_divergence(infile):
re_1 = re.compile('>HEADER1\n([^>]+)', re.MULTILINE)
re_2 = re.compile('>HEADER2\n([^>]+)', re.MULTILINE)
valid = []
mismatch = []
mycmp = cmp # create local cmp for faster calling in the loop
with open(infile, 'rb') as f:
alignment = f.read()
# get sequences and remove header and newlines
seq_1 = iter(re_1.search(alignment).group(1).replace('\n',''))
seq_2 = iter(re_2.search(alignment).group(1).replace('\n',''))
# GENERATOR BLOCK:
rm_gaps = ((i, j) for (i, j) in it.izip(seq_1, seq_2) if i != '-') # remove gaps
all_sites = enumerate(rm_gaps, 1) # enumerate (1-based)
valid_sites = ((pos, i, j) for (pos, (i, j)) in all_sites if set(i+j).issubset('ACGT')) # all sites where both sequences are valid
for (pos, i, j) in valid_sites:
valid += [pos]
if mycmp(i,j):
mismatch += [pos]
return valid, mismatch
编辑:根据大众需求,这里有一个链接到其中一个文件,供想要测试代码的人使用: https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0
答案 0 :(得分:1)
阅读你的代码,我可以告诉你,他是一个聪明的家伙,所以我会给你一些完全未经测试的东西,并让你弄清楚如何使它成功,以及是否有任何成功比你现有的更快: - )
(嘿,它不像你在你的问题中给出了一个真实的数据集......)
编辑 - 使用十六进制数字来计算不匹配。
#! /usr/bin/env python2.7
# Text is much faster in 2.7 than 3...
def rm_gaps(seq1, seq2):
''' Given a first sequence with gaps removed,
do the same operation to a second sequence.
'''
startpos = 0
for substring in seq1:
length = len(substring)
yield seq2[startpos:length]
startpos += length + 1
def seq_divergence(infile):
# Build a character translation map with ones
# in the positions of valid bases, and
# another one with hex numbers for each base.
transmap_v = ['0'] * 256
transmap_m = ['0'] * 256
for ch, num in zip('ACGT', '1248'):
transmap_v[ord(ch)] = '1'
transmap_m[ord(ch)] = num
transmap_v = ''.join(transmap_v)
transmap_m = ''.join(transmap_m)
# No need to do everything inside open -- you are
# done with the file once you have read it in.
with open(infile, 'rb') as f:
alignment = f.read()
# For this case, using regular string stuff might be faster than re
h1 = '>HEADER1\n'
h2 = h1.replace('1', '2')
h1loc = alignment.find(h1)
h2loc = alignment.find(h2)
# This assumes header 1 comes before header 2. If that is
# not invariant, you will need more code here.
seq1 = alignment[h1loc + len(h1):h2loc].replace('\n','')
seq2 = alignment[h2loc + len(h2):].replace('\n','')
# Remove the gaps
seq1 = seq1.split('-')
seq2 = rm_gaps(seq1, seq2)
seq1 = ''.join(seq1)
seq2 = ''.join(seq2)
assert len(seq1) == len(seq2)
# Let's use Python's long integer capability to
# find locations where both sequences are valid.
# Convert each sequence into a binary number,
# and AND them together.
num1 = int(seq1.translate(transmap_v), 2)
num2 = int(seq2.translate(transmap_v), 2)
valid = ('{0:0%db}' % len(seq1)).format(num1 & num2)
assert len(valid) == len(seq1)
# Now for the mismatch -- use hexadecimal instead
# of binary here. The 4 bits per character position
# nicely matches our 4 possible bases.
num1 = int(seq1.translate(transmap_m), 16)
num2 = int(seq2.translate(transmap_m), 16)
mismatch = ('{0:0%dx}' % len(seq1)).format(num1 & num2)
assert len(match) == len(seq1)
# This could possibly use some work. For example, if
# you expect very few invalid and/or mismatches, you
# could split on '0' in both these cases, and then
# use the length of the strings between the zeros
# and concatenate ranges for valid, or use them as
# skip distances for the mismatches.
valid = [x for x, y in enumerate(valid,1) if y == '1']
mismatch = [x for x, y in enumerate(mismatch, 1) if y == '0']
return valid, mismatch
答案 1 :(得分:1)
这是我的版本(但你可能会开始向我扔石头进行混淆)请发布一些较长的测试数据,我想测试一下......
seq1='CT-A-CG'
seq2='CNCTA*G'
import numpy as np
def is_not_gap(a): return 0 if (a=='-') else 1
def is_valid(a): return 1 if (a=='A' or a=='C' or a=='G' or a=='T' ) else 0
def is_mm(t): return 0 if t[0]==t[1] else 1
# find the indexes to be retained
retainx=np.where( np.multiply( map(is_not_gap, seq1), map(is_not_gap, seq2) ) )[0].tolist()
# find the valid ones
valid0=np.where(np.multiply( map( is_valid, seq1),map( is_valid, seq2))[retainx])[0].tolist()
# find the mismatches
mm=np.array(map( is_mm, zip( seq1,seq2)))
mismatch0=set(valid0) & set(np.where(mm[retainx])[0])
结果(从零开始索引):
valid0
[0, 2, 4]
mismatch0
{2}
(如果你愿意,我可以发布更长,更详细的版本)