我正在尝试合并两个文本(PDB)文件。一个(较大的一个)包含描述蛋白质的完整数据集,第二个(非常大的)包含仅改变很小一部分(坐标集)的很小的一组数据。
示例:
基本文件(部分):
ATOM 605 CD2 LEU A 92 11.727 14.051 55.011 1.00 75.51 4pxz C
ATOM 606 N ARG A 93 10.555 10.636 58.260 1.00 62.79 4pxz N
ATOM 607 CA ARG A 93 11.357 9.429 58.493 1.00 59.89 4pxz C
ATOM 608 C ARG A 93 10.429 8.207 58.562 1.00 62.83 4pxz C
ATOM 609 O ARG A 93 10.760 7.168 57.994 1.00 61.39 4pxz O
ATOM 610 CB ARG A 93 12.236 9.564 59.757 1.00 58.23 4pxz C
ATOM 611 CG ARG A 93 13.088 8.333 60.120 1.00 60.51 4pxz C
ATOM 612 CD ARG A 93 13.985 7.822 58.995 1.00 61.21 4pxz C
ATOM 613 NE ARG A 93 14.503 6.485 59.295 1.00 60.36 4pxz N
ATOM 614 CZ ARG A 93 15.012 5.642 58.400 1.00 66.21 4pxz C
ATOM 615 NH1 ARG A 93 15.074 5.979 57.116 1.00 52.54 4pxz N
ATOM 616 NH2 ARG A 93 15.455 4.453 58.780 1.00 48.93 4pxz N
ATOM 617 N THR A 94 9.247 8.357 59.192 1.00 60.68 4pxz N
ATOM 618 CA THR A 94 8.227 7.305 59.271 1.00 59.92 4pxz C
辅助文件(用一组坐标替换):
ATOM 39 CA ARG A 93 11.357 9.429 58.493 1.00 59.89 hatp C
ATOM 40 CB ARG A 93 12.236 9.564 59.757 1.00 58.23 hatp C
ATOM 41 CG ARG A 93 11.569 9.166 61.087 1.00 60.51 hatp C
ATOM 42 CD ARG A 93 12.319 8.102 61.886 1.00 61.21 hatp C
ATOM 43 NE ARG A 93 11.978 6.754 61.425 1.00 60.36 hatp N
ATOM 44 CZ ARG A 93 11.731 5.714 62.217 1.00 66.21 hatp C
ATOM 45 NH2 ARG A 93 11.430 4.535 61.694 1.00 48.93 hatp N
ATOM 46 NH1 ARG A 93 11.793 5.843 63.538 1.00 52.54 hatp N
预期结果:->更改坐标<-
ATOM 604 CD1 LEU A 92 9.685 13.033 54.000 1.00 73.10 4pxz C
ATOM 605 CD2 LEU A 92 11.727 14.051 55.011 1.00 75.51 4pxz C
ATOM 606 N ARG A 93 10.555 10.636 58.260 1.00 62.79 4pxz N
ATOM 607 CA ARG A 93 -> 11.357 9.429 58.493<- 1.00 59.89 4pxz C
ATOM 608 C ARG A 93 10.429 8.207 58.562 1.00 62.83 4pxz C
ATOM 609 O ARG A 93 10.760 7.168 57.994 1.00 61.39 4pxz O
ATOM 610 CB ARG A 93 -> 12.236 9.564 59.757<- 1.00 58.23 4pxz C
ATOM 611 CG ARG A 93 -> 11.569 9.166 61.087<- 1.00 60.51 4pxz C
ATOM 612 CD ARG A 93 -> 12.319 8.102 61.886<- 1.00 61.21 4pxz C
ATOM 613 NE ARG A 93 -> 11.978 6.754 61.425<- 1.00 60.36 4pxz N
ATOM 614 CZ ARG A 93 -> 11.731 5.714 62.217<- 1.00 66.21 4pxz C
ATOM 615 NH1 ARG A 93 -> 11.793 5.843 63.538<- 1.00 52.54 4pxz N
ATOM 616 NH2 ARG A 93 -> 11.430 4.535 61.694<- 1.00 48.93 4pxz N
ATOM 617 N THR A 94 9.247 8.357 59.192 1.00 60.68 4pxz N
ATOM 618 CA THR A 94 8.227 7.305 59.271 1.00 59.92 4pxz C
我尝试通过以下方式这样做:
为两个文件建立一个列表并将每一行作为单个条目追加
从两个文件中提取原子类型,残基名称,链和残基编号(例如分别为CD1 LEU A 92)并附加到另一个列表中
比较摘录列表
从第1点到第3点编写一个包含混合列表的文件。
代码:
import re
aminoacid_pattern = re.compile(r"\w.{2,3}.\b(\w[A-Z]\w*)\b\s.\s\d+")
coords_pattern = re.compile(r"\w.{2,3}.\b(\w[A-Z]\w*)\b\s.\s\d+")
class fileSaver:
protein = "4pxzclean.pdb"
flexres = "ARGA93.pdb.tmp"
def __init__(self):
pass
def aminoacid_to_substitute(self, flexres, data = []):
with open(flexres, 'r') as flex:
for line in flex:
if aminoacid_pattern != None:
data.append(line)
return data
def parse_rigid(self, rigidprot, test = []):
with open(rigidprot, 'r') as rigid:
for line in rigid:
if aminoacid_pattern != None:
test.append(line)
return test
class fileComparer:
def __init__(self):
pass
def compare_data(self, data_flex, data_rigid, cleanflex = [], cleanrigid = []):
for el in data_flex:
if aminoacid_pattern != None:
cleanflex.append(re.findall(r".\w\s+\w.{2,3}\s\w\s*\d{2,3}",str(el)))
for el in data_rigid:
if aminoacid_pattern != None:
cleanrigid.append(re.findall(r".\w\s+\w.{2,3}\s\w\s*\d{2,3}",str(el)))
with open("test.txt", 'a+') as test:
for rig_el in data_rigid:
for flex_el in data_flex:
for rg_el in cleanrigid:
if rg_el not in cleanflex:
test.write(rig_el)
if rg_el in cleanflex:
test.write(flex_el)
if __name__ == '__main__':
initialize = fileSaver()
flex = initialize.aminoacid_to_substitute("ARGA93.pdb.tmp")
rigid = initialize.parse_rigid("4pxzclean.pdb")
comparer = fileComparer()
comparer.compare_data(flex,rigid)
不幸的是,它给出了无限长的文件,没有任何更改。你能告诉我哪里出了错吗?