序列比对

时间:2013-02-08 10:21:21

标签: r biopython bioperl

我想与uniprot和pdb序列进行成对比对。我有一个包含uniprot和pdb ID的输入文件。

pdb id  uniprot id

1dbh    Q07889
1e43    P00692
1f1s    Q53591

首先,我需要读取输入文件中的每一行     2)从pdb.fasta和uniprot.fasta文件中检索pdb和uniprot序列     3)做对齐并计算序列同一性。

通常,我使用以下程序进行成对比对和seq.identity计算。

library("seqinr")
seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR"
seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM"
library(Biostrings)
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")

我需要像这样打印输出

pdbid   uniprotid  seq.identity
1dbh    Q07889      99
1e43    P00692      80
1f1s    Q53591      56

如何更改上述代码?你的帮助将不胜感激!

2 个答案:

答案 0 :(得分:1)

此代码有望成为您的目标:

class test():

    def get_seq(self, pdb,fasta_file): # Get sequences
        from Bio.PDB.PDBParser import PDBParser
        from Bio import SeqIO
        aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
        p=PDBParser(PERMISSIVE=1)
        structure_id="%s" % pdb[:-4]
        structure=p.get_structure(structure_id, pdb)
        residues = structure.get_residues()
        seq_pdb = ''
        for res in residues:
            res = res.get_resname() 
            if res in aa:
                seq_pdb = seq_pdb+aa[res]           

        handle = open(fasta_file, "rU")
        for record in SeqIO.parse(handle, "fasta") :
            seq_fasta = record.seq
        handle.close()
        self.seq_aln(seq_pdb,seq_fasta)

    def seq_aln(self,seq1,seq2): # Align the sequences
        from Bio import pairwise2
        from Bio.SubsMat import MatrixInfo as matlist

        matrix = matlist.blosum62
        gap_open = -10
        gap_extend = -0.5

        alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
        top_aln = alns[0]
        aln_seq1, aln_seq2, score, begin, end = top_aln
        with open('aln.fasta', 'w') as outfile:
            outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
        print aln_seq1+'\n'+aln_seq2
        self.seq_id('aln.fasta')

    def seq_id(self,aln_fasta): # Get sequence ID
        import string
        from Bio import AlignIO

        input_handle = open("aln.fasta", "rU")
        alignment = AlignIO.read(input_handle, "fasta")
        j=0 # counts positions in first sequence
        i=0 # counts identity hits
        for record in alignment:
            #print record
            for amino_acid in record.seq:
                if amino_acid == '-':
                    pass
                else:
                    if amino_acid == alignment[0].seq[j]:
                        i += 1
                j += 1
            j = 0
            seq = str(record.seq)
            gap_strip = seq.replace('-', '')

            percent = 100*i/len(gap_strip)
            print record.id+' '+str(percent)
            i=0


a = test()
a.get_seq('1DBH.pdb','Q07889.fasta')

输出:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence

为了使您能够处理输入文件,您需要将a.get_seq()放在带有文本文件输入的for循环中。

修改

用这个替换seq_id函数:

def seq_id(self,aln_fasta):
    import string
    from Bio import AlignIO
    from Bio import SeqIO

    record_iterator = SeqIO.parse(aln_fasta, "fasta")
    first_record = record_iterator.next()
    print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-','')))
    second_record = record_iterator.next()
    print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-','')))

    lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))]
    if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest
        print 'PDB sequence has the shortest length'
    else:
        print 'Uniport sequence has the shortes length'

    idenities = 0   
    for i,v in enumerate(first_record.seq):
        if v == '-':
            pass
            #print i,v, second_record.seq[i]
        if v == second_record.seq[i]:
            idenities +=1
            #print i,v, second_record.seq[i], idenities

    print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))

将参数传递给类use:

with open('input_file.txt', 'r') as infile:
    next(infile)
    next(infile) # Going by your input file
    for line in infile:
        line = line.split()
        a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')

答案 1 :(得分:0)

可能是这样的;一个可重复的例子(例如,在线发布的短文件)将有助于......

library(Biostrings)
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")

将所有序列输入到两个对象中。 pairwiseAlignment接受向量作为第一个(查询)参数,因此如果您想要将所有pdb与所有uniprot对齐,请预先分配结果矩阵

pids = matrix(numeric(), length(uniprot), length(pdb),
              dimnames=list(names(uniprot), names(pdb)))

然后进行计算

for (i in seq_along(uniprot)) {
    globalAlignment = pairwiseAlignment(pdb, uniprot[i])
    pids[i,] = pid(globalAlignment)
}