Question

我在python中编写了一个代码来读取DNA序列（稍后对它们进行主题对齐）但是，我正在寻找一种更有效的方法来做到这一点。

如果可以提供帮助，请参阅下文：

handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
    text_file.write(a)

f = 0
z = 100
b = ''
while f < len(a):
    b += a[f:z]+'\n'
    f += 1
    z += 1
with open("2.txt", "w") as runner_mtfs:
   runner_mtfs.write(b)

总之，我想对b的每一行做一堆分析，但我不知道更有效的方法来做到这一点，而不是分开每100个碱基对。输出文件超过500兆字节。有什么建议吗？

代码的第一部分只是一个DNA序列，我将所有的行连接在一起，我将100个碱基对分开。

Answer 1

我在这里看到的主要问题是你将所有内容写入文件中。这样做没有意义。您创建的大型输出文件非常多余，在进行分析时将其重新加载是没有用的。

在您最初加载文件后，您有兴趣查看的每个窗口a[x:x+100]代表某些x。您根本不需要明确地生成这些窗口：这样做不会有任何好处。通过，直接从每个窗口生成这些矩阵。

如果你真的需要整个事情，那就把它生成为一个numpy数组。另外，如果我没有使用任何退化基本代码，请使用0,1,2,3将序列转换为uint8s，用于A，C，G，T。这有助于加快速度，特别是在需要时在任何一点上都可以采取补充，这可以简单地摆弄比特。

Numpy可以使用stride_tricks非常高效地生成数组，如上所述in this blog post：

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)

或者，转换为整数：

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
conv = {'a': 0, 'c': 1, 'g': 2, 't': 3}
b = numpy.array([conv[x] for x in a], dtype=numpy.uint8)
rolling_window(b,100)

此代码比我的机器快十倍。

Answer 2

如果是类似.fasta的文件，它很可能包含多个序列。
在stackoverflow上有很多在python中读取大文件的例子，给出了一些有用的方法here。我通常使用最佳答案中给出的配方来解决该问题（with open(...) file）。它速度快，消耗的内存也少。

您似乎希望使用固定大小的滑动窗口处理数据。我会这样做：

def load_fasta(fasta_file_name, sliding_window_size = 100):
  buffer = ''
  with open(fasta_file_name) as f:
    for line in f:
      if line.startswith('>'):
        #skip or get some info from comment line
        buffer = ''
      else:
        #read next line
        buffer += line.strip('\r\n')
        offset = 0 # zero-based offset for current string
        while (offset + sliding_window_size <= len(buffer)):
          next_sliding_window = buffer[offset : offset + sliding_window_size]
          yield(next_sliding_window)
          offset += 1
        buffer = buffer[offset : ]

for str in load_fasta("a.fas.txt", 100):
  #do some processing with sliding window data
  print(str)

如果您确实想要处理长度小于100的数据部分（或者在我的示例中，小于sliding window size），则必须稍微修改该功能（在新注释行的外观处和处理结束）。

您也可以biopython。

Answer 3

这是一个可以做你想做的事情的课程。

"""
Read in genome of E. Coli (or whatever) from given input file,
process it in segments of 100 basepairs at a time.

Usage: 100pairs [-n <pairs>] [-p] <file>

<file>                 Input file.
-n,--numpairs <pairs>  Use <pairs> per iteration. [default: 100]
-p,--partial           Allow partial sequences at end of genome.
"""
import docopt

class GeneBuffer:
    def __init__(self, path, bases=100, partial=True):
        self._buf = None
        self.bases = int(bases)
        self.partial = partial
        self.path = path

    def __enter__(self):
        self._file = open(self.path, 'r')
        self._header = next(self._file)
        return self

    def __exit__(self, *args):
        if self._file:
            self._file.close()

    def __iter__(self):
        return self

    def __next__(self):
        if self._buf is None:
            self._buf = ''

        while self._file and len(self._buf) < self.bases:
            try:
                self._buf += next(self._file).strip()
            except StopIteration:
                self._file.close()
                self._file = None
                break

        if len(self._buf) < self.bases:
            if len(self._buf) == 0 or not self.partial:
                raise StopIteration

        result = self._buf[:self.bases]
        self._buf = self._buf[1:]

        return result

def analyze(basepairs):
    """
    Dammit, Jim! I'm a computer programmer, not a geneticist!
    """
    print(basepairs)

def main(args):
    numpairs = args['--numpairs']
    partial = args['--partial']
    with GeneBuffer(args['<file>'], bases=numpairs, partial=partial) as genome:
        print("Header: ", genome._header)
        count = 0
        for basepairs in genome:
            count += 1
            print(count, end=' ')
            analyze(basepairs)

if __name__ == '__main__':
    args = docopt.docopt(__doc__)
    main(args)

如何更有效地阅读DNA序列？

3 个答案: