在python中编辑文件并创建一个新文件

时间:2018-06-11 13:19:56

标签: python bioinformatics biopython fasta

我有一个大文本文件(“|”分隔),就像这个小例子:

>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|RNF14|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|HAS-0|HAS|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL

第一行是以"<"开头的ID行,第二行是属于上述ID行的字符序列。看第6列有重复的名称,第7个长度是ID后面的行长度(字符序列)。我想根据第7列选择每个ID行的一个重复,这意味着具有最长长度的ID。小例子的预期输出是:

>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|HAS-202|HAS|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL

因此根据ID的长度,每条column 6行重复一次(查看column 7.) 我在python中尝试了以下代码,但它不起作用。你知道怎么解决吗?

from __future__ import print_function
import sys

def parse_fasta(data):
    name, seq = None, []
    for line in data:
        line = line.rstrip()
        if line.startswith('>'):
            if name:
                yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name:
        yield (name, ''.join(seq))

isoforms = dict()
for defline, sequence in parse_fasta(sys.stdin):
    geneid = '.'.join(defline[1:].split('.')[:-1])
    if geneid in isoforms:
        otherdefline, othersequence = isoforms[geneid]
        if len(sequence) > len(othersequence):
            isoforms[geneid] = (defline, sequence)
    else:
        isoforms[geneid] = (defline, sequence)

for defline, sequence in isoforms.values():
    print(defline, sequence, sep='\n')

1 个答案:

答案 0 :(得分:1)

我建议您使用Biopython,而不是构建自己的解析器。注意我还添加了一个完整性检查(在您的情况下,结尾474的FASTA标题行实际上只有388的长度序列:

from Bio import SeqIO

def yield_records():
    seen = set()
    for record in SeqIO.parse('in.fa', 'fasta'):
        header_seq_len = int(record.description.split('|')[-1])
        seq_len = len(record)
        if header_seq_len != seq_len:
            print('Warning: the seq length {} != that stated in the header {}'
                   .format(seq_len, header_seq_len))
        if header_seq_len not in seen:
            yield record
            seen.add(header_seq_len)

SeqIO.write(yield_records(), 'out.fa', 'fasta')