Question

我想与uniprot和pdb序列进行成对比对。我有一个包含uniprot和pdb ID的输入文件。

pdb id  uniprot id

1dbh    Q07889
1e43    P00692
1f1s    Q53591

首先，我需要读取输入文件中的每一行 2）从pdb.fasta和uniprot.fasta文件中检索pdb和uniprot序列 3）做对齐并计算序列同一性。

通常，我使用以下程序进行成对比对和seq.identity计算。

library("seqinr")
seq1 <- "MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR"
seq2<- "MKGQQKTAETEEGTVQIQEGAVATGEDPTSVAIASIQSAATFPDPNVKYVFRTENGGQVM"
library(Biostrings)
globalAlign<- pairwiseAlignment(seq1, seq2)
pid(globalAlign, type = "PID3")

我需要像这样打印输出

pdbid   uniprotid  seq.identity
1dbh    Q07889      99
1e43    P00692      80
1f1s    Q53591      56

如何更改上述代码？你的帮助将不胜感激！

Answer 1

此代码有望成为您的目标：

class test():

    def get_seq(self, pdb,fasta_file): # Get sequences
        from Bio.PDB.PDBParser import PDBParser
        from Bio import SeqIO
        aa = {'ARG':'R','HIS':'H','LYS':'K','ASP':'D','GLU':'E','SER':'S','THR':'T','ASN':'N','GLN':'Q','CYS':'C','SEC':'U','GLY':'G','PRO':'P','ALA':'A','ILE':'I','LEU':'L','MET':'M','PHE':'F','TRP':'W','TYR':'Y','VAL':'V'}
        p=PDBParser(PERMISSIVE=1)
        structure_id="%s" % pdb[:-4]
        structure=p.get_structure(structure_id, pdb)
        residues = structure.get_residues()
        seq_pdb = ''
        for res in residues:
            res = res.get_resname() 
            if res in aa:
                seq_pdb = seq_pdb+aa[res]           

        handle = open(fasta_file, "rU")
        for record in SeqIO.parse(handle, "fasta") :
            seq_fasta = record.seq
        handle.close()
        self.seq_aln(seq_pdb,seq_fasta)

    def seq_aln(self,seq1,seq2): # Align the sequences
        from Bio import pairwise2
        from Bio.SubsMat import MatrixInfo as matlist

        matrix = matlist.blosum62
        gap_open = -10
        gap_extend = -0.5

        alns = pairwise2.align.globalds(seq1, seq2, matrix, gap_open, gap_extend)
        top_aln = alns[0]
        aln_seq1, aln_seq2, score, begin, end = top_aln
        with open('aln.fasta', 'w') as outfile:
            outfile.write('> PDB_seq\n'+str(aln_seq1)+'\n> Uniprot_seq\n'+str(aln_seq2))
        print aln_seq1+'\n'+aln_seq2
        self.seq_id('aln.fasta')

    def seq_id(self,aln_fasta): # Get sequence ID
        import string
        from Bio import AlignIO

        input_handle = open("aln.fasta", "rU")
        alignment = AlignIO.read(input_handle, "fasta")
        j=0 # counts positions in first sequence
        i=0 # counts identity hits
        for record in alignment:
            #print record
            for amino_acid in record.seq:
                if amino_acid == '-':
                    pass
                else:
                    if amino_acid == alignment[0].seq[j]:
                        i += 1
                j += 1
            j = 0
            seq = str(record.seq)
            gap_strip = seq.replace('-', '')

            percent = 100*i/len(gap_strip)
            print record.id+' '+str(percent)
            i=0


a = test()
a.get_seq('1DBH.pdb','Q07889.fasta')

输出：

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------EQTYYDLVKAF-AEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVE-TDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSG-EKICSKSLAKRRLSESA-------------AIKK-NEIQKNIDGWEGKDIGQCCNEFI-EGTLTRVGAKHERHIFLFDGL-ICCKSNHGQPRLPGASNAEYRLKEKFF-RKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNW-AALISLQYRSTL---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
MQAQQLPYEFFSEENAPKWRGLLVPALKKVQGQVHPTLESNDDALQYVEELILQLLNMLCQAQPRSASDVEERVQKSFPHPIDKWAIADAQSAIEKRKRRNPLSLPVEKIHPLLKEVLGYKIDHQVSVYIVAVLEYISADILKLVGNYVRNIRHYEITKQDIKVAMCADKVLMDMFHQDVEDINILSLTDEEPSTSGEQTYYDLVKAFMAEIRQYIRELNLIIKVFREPFVSNSKLFSANDVENIFSRIVDIHELSVKLLGHIEDTVEMTDEGSPHPLVGSCFEDLAEELAFDPYESYARDILRPGFHDRFLSQLSKPGAALYLQSIGEGFKEAVQYVLPRLLLAPVYHCLHYFELLKQLEEKSEDQEDKECLKQAITALLNVQSGMEKICSKSLAKRRLSESACRFYSQQMKGKQLAIKKMNEIQKNIDGWEGKDIGQCCNEFIMEGTLTRVGAKHERHIFLFDGLMICCKSNHGQPRLPGASNAEYRLKEKFFMRKVQINDKDDTNEYKHAFEIILKDENSVIFSAKSAEEKNNWMAALISLQYRSTLERMLDVTMLQEEKEEQMRLPSADVYRFAEPDSEENIIFEENMQPKAGIPIIKAGTVIKLIERLTYHMYADPNFVRTFLTTYRSFCKPQELLSLIIERFEIPEPEPTEADRIAIENGDQPLSAELKRFRKEYIQPVQLRVLNVCRHWVEHHFYDFERDAYLLQRMEEFIGTVRGKAMKKWVESITKIIQRKKIARDNGPGHNITFQSSPPTVEWHISRPGHIETFDLLTLHPIEIARQLTLLESDLYRAVQPSELVGSVWTKEDKEINSPNLLKMIRHTTNLTLWFEKCIVETENLEERVAVVSRIIEILQVFQELNNFNGVLEVVSAMNSSPVYRLDHTFEQIPSRQKKILEEAHELSEDHYKKYLAKLRSINPPCVPFFGIYLTNILKTEEGNPEVLKRHGKELINFSKRRKVAEITGEIQQYQNQPYCLRVESDIKRFFENLNPMGNSMEKEFTDYLFNKSLEIEPRNPKPLPRFPKKYSYPLKSPGVRPSNPRPGTMRHPTPLQQEPRKISYSRIPESETESTASAPNSPRTPLTPPPASGASSTTDVCSVFDSDHSSPFHSSNDTVFIQVTLPHGPRSASVSSISLTKGTDEVPVPPPVPPRRRPESAPAESSPSKIMSKHLDSPPAIPPRQPTSKAYSPRYSISDRTSISDPPESPPLLPPREPVRTPDVFSSSPLHLQPPPLGKKSDHGNAFFPNSPSPFTPPPPQTPSPHGTRRHLPSPPLTQEVDLHSIAGPPVPPRQSTSQHIPKLPPKTYKREHTHPSMHRDGPPLLENAHSS
PDB_seq 100 # pdb to itself would obviously have 100% identity
Uniprot_seq 24 # pdb sequence has 24% identity to the uniprot sequence

为了使您能够处理输入文件，您需要将a.get_seq()放在带有文本文件输入的for循环中。

修改

用这个替换seq_id函数：

def seq_id(self,aln_fasta): import string from Bio import AlignIO from Bio import SeqIO record_iterator = SeqIO.parse(aln_fasta, "fasta") first_record = record_iterator.next() print '%s has a length of %d' % (first_record.id, len(str(first_record.seq).replace('-',''))) second_record = record_iterator.next() print '%s has a length of %d' % (second_record.id, len(str(second_record.seq).replace('-',''))) lengths = [len(str(first_record.seq).replace('-','')), len(str(second_record.seq).replace('-',''))] if lengths.index(min(lengths)) == 0: # If both sequences have the same length the PDB sequence will be taken as the shortest print 'PDB sequence has the shortest length' else: print 'Uniport sequence has the shortes length' idenities = 0 for i,v in enumerate(first_record.seq): if v == '-': pass #print i,v, second_record.seq[i] if v == second_record.seq[i]: idenities +=1 #print i,v, second_record.seq[i], idenities print 'Sequence Idenity = %.2f percent' % (100.0*(idenities/min(lengths)))

将参数传递给类use：

with open('input_file.txt', 'r') as infile: next(infile) next(infile) # Going by your input file for line in infile: line = line.split() a.get_seq(segs[0]+'.pdb',segs[1]+'.fasta')

Answer 2

可能是这样的;一个可重复的例子（例如，在线发布的短文件）将有助于......

library(Biostrings)
pdb = readAAStringSet("pdb.fasta")
uniprot = readAAStringSet("uniprot.fasta")

将所有序列输入到两个对象中。 pairwiseAlignment接受向量作为第一个（查询）参数，因此如果您想要将所有pdb与所有uniprot对齐，请预先分配结果矩阵

pids = matrix(numeric(), length(uniprot), length(pdb),
              dimnames=list(names(uniprot), names(pdb)))

然后进行计算

for (i in seq_along(uniprot)) {
    globalAlignment = pairwiseAlignment(pdb, uniprot[i])
    pids[i,] = pid(globalAlignment)
}

序列比对

2 个答案: