我需要在python中切出一个很长的字符串(DNA序列),目前我正在使用它:
new_seq = clean_seq[start:end]
我每隔20000个字符切片一次,并拍摄1000张长片(大约)
它是一个包含几个字符串的250MB文件,每个字符串都标识了一个ID,这种方法花费的时间太长了。 序列字符串来自biopython模块:
def fasta_from_ann(annotation, sequence, feature, windows, output_fasta):
df_gff = pd.read_csv(annotation, index_col=False, sep='\t',header=None)
df_gff.columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
fasta_seq = SeqIO.parse(sequence,'fasta')
buffer = []
for record in fasta_seq:
df_exctract = df_gff[(df_gff.seqname == record.id) & (df_gff.feature == feature)]
for k,v in df_exctract.iterrows():
clean_seq = ''.join(str(record.seq).splitlines())
if int(v.start) - windows < 0:
start = 0
else:
start = int(v.start) - windows
if int(v.end) + windows > len(clean_seq):
end = len(clean_seq)
else:
end = int(v.end) + windows
new_seq = clean_seq[start:end]
new_id = record.id + "_from_" + str(v.start) + "_to_" + str(v.end) + "_feature_" + v.feature
desc = "attribute: " + v.attribute + " strand: " + v.strand
seq = SeqRecord(Seq(new_seq), id=new_id,description = desc)
buffer.append(seq)
print(record.id)
SeqIO.write(buffer, output_fasta, "fasta")
也许有更好的记忆方式来实现这一目标。