如何将两个非常大的字符串压缩并返回匹配和不匹配的索引?

时间:2015-07-29 01:11:54

标签: python arrays regex iterator enumerate

我有一组文本文件,其中包含两个相同长度的非常大的字符集。字符集是DNA序列,所以我打算将它们称为seq_1seq_2,它们一起称为alignment。文件看起来像这样:

>HEADER1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>HEADER2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>HEADER1下的序列1中的可能字符为ACGTN->HEADER2下的序列2中的可能字符为ACGTN-*

我想分析序列并返回两个索引列表,我将其称为validmismatch

valid包含所有(从1开始)的索引,其中两个序列中的位置(“对齐”)都在集合ACGT中; mismatch包含所有(从1开始)的索引,其中对齐中的位置在集合ACGT中但彼此不匹配。因此mismatchvalid的子集。

最后一个条件是我 NOT 在序列1为"-"的位置递增索引计数器,因为这些在我正在使用的坐标系中被视为“间隙”。

此示例显示了我预期输出的对齐方式:

seq_1      = 'CT-A-CG'  # total length = 5 because of the two gaps
seq_2      = 'CNCTA*G'
valid      = [1,3,5]    # all indices where both sequences are in 'ACGT' without counting gaps
mismatch   = [3]        # subset of valid where alignment does not match

我希望改进我当前的代码(下面),它涉及序列的正则表达式提取,非间隙站点的压缩和枚举到生成器对象中然后 - 主要的耗时步骤 - 循环通过这个发电机并填写两个清单。我觉得必须有一个基于数组或itertools解决这个问题的解决方案,它比序列中的for循环更有效,它们的索引压缩在一起,我正在寻找建议。

代码:

def seq_divergence(infile):

    re_1 = re.compile('>HEADER1\n([^>]+)', re.MULTILINE)
    re_2 = re.compile('>HEADER2\n([^>]+)', re.MULTILINE)    

    valid     = []
    mismatch  = []

    mycmp     = cmp  # create local cmp for faster calling in the loop

    with open(infile, 'rb') as f:

        alignment = f.read()

    # get sequences and remove header and newlines
    seq_1       = iter(re_1.search(alignment).group(1).replace('\n',''))
    seq_2       = iter(re_2.search(alignment).group(1).replace('\n',''))

    # GENERATOR BLOCK:
    rm_gaps     = ((i, j) for (i, j) in it.izip(seq_1, seq_2) if i != '-')  # remove gaps
    all_sites   = enumerate(rm_gaps, 1)  # enumerate (1-based)
    valid_sites = ((pos, i, j) for (pos, (i, j)) in all_sites if set(i+j).issubset('ACGT'))  # all sites where both sequences are valid

    for (pos, i, j) in valid_sites:

        valid += [pos]
        if mycmp(i,j):
            mismatch += [pos]

    return valid, mismatch

编辑:根据大众需求,这里有一个链接到其中一个文件,供想要测试代码的人使用: https://www.dropbox.com/s/i904fil7cvv1vco/chr1_homo_maca_100Multiz.fa?dl=0

2 个答案:

答案 0 :(得分:1)

阅读你的代码,我可以告诉你,他是一个聪明的家伙,所以我会给你一些完全未经测试的东西,并让你弄清楚如何使它成功,以及是否有任何成功比你现有的更快: - )

(嘿,它不像你在你的问题中给出了一个真实的数据集......)

编辑 - 使用十六进制数字来计算不匹配。

#! /usr/bin/env python2.7

# Text is much faster in 2.7 than 3...

def rm_gaps(seq1, seq2):
    ''' Given a first sequence with gaps removed,
        do the same operation to a second sequence.
    '''
    startpos = 0
    for substring in seq1:
        length = len(substring)
        yield seq2[startpos:length]
        startpos += length + 1

def seq_divergence(infile):

    # Build a character translation map with ones
    # in the positions of valid bases, and
    # another one with hex numbers for each base.

    transmap_v = ['0'] * 256
    transmap_m = ['0'] * 256
    for ch, num in zip('ACGT', '1248'):
        transmap_v[ord(ch)] = '1'
        transmap_m[ord(ch)] = num
    transmap_v = ''.join(transmap_v)
    transmap_m = ''.join(transmap_m)


    # No need to do everything inside open -- you are
    # done with the file once you have read it in.
    with open(infile, 'rb') as f:
        alignment = f.read()

    # For this case, using regular string stuff might be faster than re

    h1 = '>HEADER1\n'
    h2 = h1.replace('1', '2')

    h1loc = alignment.find(h1)
    h2loc = alignment.find(h2)

    # This assumes header 1 comes before header 2.  If that is
    # not invariant, you will need more code here.

    seq1 = alignment[h1loc + len(h1):h2loc].replace('\n','')
    seq2 = alignment[h2loc + len(h2):].replace('\n','')

    # Remove the gaps
    seq1 = seq1.split('-')
    seq2 = rm_gaps(seq1, seq2)
    seq1 = ''.join(seq1)
    seq2 = ''.join(seq2)

    assert len(seq1) == len(seq2)

    # Let's use Python's long integer capability to
    # find locations where both sequences are valid.
    # Convert each sequence into a binary number,
    # and AND them together.
    num1 = int(seq1.translate(transmap_v), 2)
    num2 = int(seq2.translate(transmap_v), 2)
    valid = ('{0:0%db}' % len(seq1)).format(num1 & num2)
    assert len(valid) == len(seq1)


    # Now for the mismatch -- use hexadecimal instead
    # of binary here.  The 4 bits per character position
    # nicely matches our 4 possible bases.
    num1 = int(seq1.translate(transmap_m), 16)
    num2 = int(seq2.translate(transmap_m), 16)
    mismatch = ('{0:0%dx}' % len(seq1)).format(num1 & num2)
    assert len(match) == len(seq1)

    # This could possibly use some work.  For example, if
    # you expect very few invalid and/or mismatches, you
    # could split on '0' in both these cases, and then
    # use the length of the strings between the zeros
    # and concatenate ranges for valid, or use them as
    # skip distances for the mismatches.

    valid = [x for x, y in enumerate(valid,1) if y == '1']
    mismatch = [x for x, y in enumerate(mismatch, 1) if y == '0']

    return valid, mismatch

答案 1 :(得分:1)

这是我的版本(但你可能会开始向我扔石头进行混淆)请发布一些较长的测试数据,我想测试一下......

seq1='CT-A-CG'
seq2='CNCTA*G'

import numpy as np
def is_not_gap(a): return 0 if (a=='-') else 1
def is_valid(a): return 1 if (a=='A' or a=='C' or a=='G' or a=='T' ) else 0
def is_mm(t): return 0 if t[0]==t[1] else 1

# find the indexes to be retained
retainx=np.where( np.multiply( map(is_not_gap, seq1), map(is_not_gap, seq2) ) )[0].tolist()

# find the valid ones
valid0=np.where(np.multiply( map( is_valid, seq1),map( is_valid, seq2))[retainx])[0].tolist()

# find the mismatches
mm=np.array(map( is_mm, zip( seq1,seq2)))
mismatch0=set(valid0) & set(np.where(mm[retainx])[0])

结果(从零开始索引):

 valid0
 [0, 2, 4]

 mismatch0
 {2}

(如果你愿意,我可以发布更长,更详细的版本)