我有一个32 GB内存的系统,其中大部分可用于我正在尝试的工作:
$ more /proc/meminfo
MemFree: 29535136 kB
MemAvailable: 30789956 kB
...
我有一些代码将字符串中的字母编码为矢量:
#!/usr/bin/env python
import os
import sys
import numpy as np
from Bio import SeqIO
import errno
import gzip
import shutil
seq_encoding = {'A' : [1, 0, 0, 0],
'C' : [0, 1, 0, 0],
'G' : [0, 0, 1, 0],
'T' : [0, 0, 0, 1],
'N' : [0, 0, 0, 0]}
sequence_chunk_length = 200
def sequence_split_by_length(seq, n):
"""
A generator to divide a sequence into chunks of n characters and return
the base array.
"""
while seq:
yield [seq_encoding[base] for base in seq[:n].upper()]
seq = seq[n:]
def encode_chromosome(name, length):
enc_records = []
fasta_fn = os.path.join(fasta_directory, name + '.fa')
fasta_fh = open(fasta_fn, "rU")
for record in SeqIO.parse(fasta_fh, "fasta"):
for chunk in sequence_split_by_length(str(record.seq), sequence_chunk_length):
enc_records.extend(np.asarray(chunk))
fasta_fh.close()
enc_arr = np.asarray(enc_records)
# ... some more code not relevant to exception ...
编码在该行失败:
enc_arr = np.asarray(enc_records)
以下是抛出异常的相关部分:
Traceback (most recent call last):
File "./encode_sequences.py", line 95, in <module>
res = encode_chromosome(chromosome_name, sequence_chunk_length)
File "./encode_sequences.py", line 78, in encode_chromosome
enc_arr = np.asarray(enc_records)
...
MemoryError
将要编码的数据结构大小约为1 GB,这似乎符合此系统上可用的可用内存。
是否有替代方法或程序将Python列表转换为Numpy数组,这有助于通过MemoryError
等Numpy方法绕过asarray()
例外?
答案 0 :(得分:0)
这应该是一个快速修复:
enc_records = []
for record in SeqIO.parse(fasta_fh, "fasta"):
for chunk in sequence_split_by_length(str(record.seq), sequence_chunk_length):
enc_records.append(np.asarray(chunk, dtype=np.int8))
enc_arr = np.vstack(enc_record)
我改变了两件事:
int8
代替默认的整数dtype。默认的numpy整数是32或64位,具体取决于平台,而int8
只有8位。