使用biopython对应于蛋白质序列中缺失残基的缺口字符

时间:2017-05-24 06:54:39

标签: biopython pdb

我从pdb中提取序列而没有遗漏残基,但我需要一个具有缺口(短划线字符)替换缺失残基的序列。我怎么能这样做?

谢谢

1 个答案:

答案 0 :(得分:0)

我是通过解析pdb的ATOM记录来获取现有残留物并通过解析REMARK 465以获取缺失残留物来手动完成的:

pname = '4g5j.pdb' #downloaded file from PDB

fin = open(pname,'r')
content = fin.readlines()
fin.close()

res = []
mis_res = []

print('CHAIN A will be used for ALIGNMENT')
het_chain = 'A'

for i,line in enumerate(content):

  if line[0:4] == 'ATOM':

    split = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]

    if split[4] != het_chain:
      continue
    res.append(int(split[5]))

for i,line in enumerate(content):

  if line[0:10] == 'REMARK 465':

    split = [line[:10], line[19], line[21:26]]

    if split[1] == het_chain:

      mis_res.append(int(split[2]))

resindexes = sorted(list(set(sorted(res))))

missed_resindexes = sorted(list(set(mis_res)))
missed_resindexes = [el for el in missed_resindexes if el not in resindexes]

all_indexes = sorted(resindexes+missed_resindexes)
print(len(all_indexes))

#here you should have your real sequence!
real_seq = 'GSMGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQG'

missed_seq = real_seq[:]

for i,el in enumerate(real_seq):
  if all_indexes[i] in missed_resindexes:
    print(i)
    missed_seq = missed_seq[:i]+'-'+missed_seq[i+1:]

print(missed_seq)

输出: --- GEAPNQALLRILKETEFKKIKVLGS ---- TVYKGLWIPEGEKVKIPVAIKE ---------- ------ KEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEY