Question

我需要在python中切出一个很长的字符串（DNA序列），目前我正在使用它：

new_seq = clean_seq[start:end]

我每隔20000个字符切片一次，并拍摄1000张长片（大约）

它是一个包含几个字符串的250MB文件，每个字符串都标识了一个ID，这种方法花费的时间太长了。序列字符串来自biopython模块：

def fasta_from_ann(annotation, sequence, feature, windows, output_fasta):
    df_gff = pd.read_csv(annotation, index_col=False, sep='\t',header=None)
    df_gff.columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
    fasta_seq = SeqIO.parse(sequence,'fasta')
    buffer = []
    for record in fasta_seq:
        df_exctract = df_gff[(df_gff.seqname == record.id) & (df_gff.feature == feature)]
        for k,v in df_exctract.iterrows():
            clean_seq = ''.join(str(record.seq).splitlines())
            if int(v.start) - windows < 0:
                start = 0
            else:
                start = int(v.start) - windows
            if int(v.end) + windows > len(clean_seq):
                end = len(clean_seq)
            else:
                end = int(v.end) + windows
            new_seq = clean_seq[start:end]
            new_id = record.id + "_from_" + str(v.start) + "_to_" + str(v.end) + "_feature_" + v.feature
            desc = "attribute: " + v.attribute + " strand: " + v.strand
            seq = SeqRecord(Seq(new_seq), id=new_id,description = desc)
            buffer.append(seq)
            print(record.id)
    SeqIO.write(buffer, output_fasta, "fasta")

也许有更好的记忆方式来实现这一目标。

如何在python中切割一个很长的字符串

0 个答案: