我在python中编写了一个代码来读取DNA序列(稍后对它们进行主题对齐)但是,我正在寻找一种更有效的方法来做到这一点。
如果可以提供帮助,请参阅下文:
handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
text_file.write(a)
f = 0
z = 100
b = ''
while f < len(a):
b += a[f:z]+'\n'
f += 1
z += 1
with open("2.txt", "w") as runner_mtfs:
runner_mtfs.write(b)
总之,我想对b的每一行做一堆分析,但我不知道更有效的方法来做到这一点,而不是分开每100个碱基对。输出文件超过500兆字节。有什么建议吗?
代码的第一部分只是一个DNA序列,我将所有的行连接在一起,我将100个碱基对分开。
答案 0 :(得分:1)
我在这里看到的主要问题是你将所有内容写入文件中。这样做没有意义。您创建的大型输出文件非常多余,在进行分析时将其重新加载是没有用的。
在您最初加载文件后,您有兴趣查看的每个窗口a[x:x+100]
代表某些x
。您根本不需要明确地生成这些窗口:这样做不会有任何好处。通过,直接从每个窗口生成这些矩阵。
如果你真的需要整个事情,那就把它生成为一个numpy数组。另外,如果我没有使用任何退化基本代码,请使用0,1,2,3将序列转换为uint8s,用于A,C,G,T。这有助于加快速度,特别是在需要时在任何一点上都可以采取补充,这可以简单地摆弄比特。
Numpy可以使用stride_tricks
非常高效地生成数组,如上所述in this blog post:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)
或者,转换为整数:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
conv = {'a': 0, 'c': 1, 'g': 2, 't': 3}
b = numpy.array([conv[x] for x in a], dtype=numpy.uint8)
rolling_window(b,100)
此代码比我的机器快十倍。
答案 1 :(得分:1)
with open(...) file
)。它速度快,消耗的内存也少。您似乎希望使用固定大小的滑动窗口处理数据。 我会这样做:
def load_fasta(fasta_file_name, sliding_window_size = 100):
buffer = ''
with open(fasta_file_name) as f:
for line in f:
if line.startswith('>'):
#skip or get some info from comment line
buffer = ''
else:
#read next line
buffer += line.strip('\r\n')
offset = 0 # zero-based offset for current string
while (offset + sliding_window_size <= len(buffer)):
next_sliding_window = buffer[offset : offset + sliding_window_size]
yield(next_sliding_window)
offset += 1
buffer = buffer[offset : ]
for str in load_fasta("a.fas.txt", 100):
#do some processing with sliding window data
print(str)
如果您确实想要处理长度小于100的数据部分(或者在我的示例中,小于sliding window size
),则必须稍微修改该功能(在新注释行的外观处和处理结束)。
您也可以biopython。
答案 2 :(得分:1)
这是一个可以做你想做的事情的课程。
"""
Read in genome of E. Coli (or whatever) from given input file,
process it in segments of 100 basepairs at a time.
Usage: 100pairs [-n <pairs>] [-p] <file>
<file> Input file.
-n,--numpairs <pairs> Use <pairs> per iteration. [default: 100]
-p,--partial Allow partial sequences at end of genome.
"""
import docopt
class GeneBuffer:
def __init__(self, path, bases=100, partial=True):
self._buf = None
self.bases = int(bases)
self.partial = partial
self.path = path
def __enter__(self):
self._file = open(self.path, 'r')
self._header = next(self._file)
return self
def __exit__(self, *args):
if self._file:
self._file.close()
def __iter__(self):
return self
def __next__(self):
if self._buf is None:
self._buf = ''
while self._file and len(self._buf) < self.bases:
try:
self._buf += next(self._file).strip()
except StopIteration:
self._file.close()
self._file = None
break
if len(self._buf) < self.bases:
if len(self._buf) == 0 or not self.partial:
raise StopIteration
result = self._buf[:self.bases]
self._buf = self._buf[1:]
return result
def analyze(basepairs):
"""
Dammit, Jim! I'm a computer programmer, not a geneticist!
"""
print(basepairs)
def main(args):
numpairs = args['--numpairs']
partial = args['--partial']
with GeneBuffer(args['<file>'], bases=numpairs, partial=partial) as genome:
print("Header: ", genome._header)
count = 0
for basepairs in genome:
count += 1
print(count, end=' ')
analyze(basepairs)
if __name__ == '__main__':
args = docopt.docopt(__doc__)
main(args)