我想将蛋白质数据库中的PDB文件与Cosmic或Uniprot中显示的蛋白质的规范AA序列进行匹配。具体来说,我需要做的是从pdb文件,骨干中的碳α原子及其xyz位置拉出来。我还需要在蛋白质序列中提取它们的实际顺序。对于结构3GFT(Kras - Uniprot登录号P01116),这很容易,我可以采用ResSeq编号。然而,对于其他一些蛋白质,我无法弄清楚这是如何可能的。
例如,对于结构(2ZHQ)(蛋白质F2- Uniprot登录号P00734),Seqres具有针对数字“1”和“14”重复的ResSeq编号,并且仅在Icode条目中不同。此外,icode条目不是词法顺序,因此很难说出要提取的顺序。
如果考虑结构3V5Q(Uniprot登录号Q16288),情况会更糟。对于大多数蛋白质,ResSeq数字与来自COSMIC或UNIPROT等来源的实际氨基酸相匹配。在位置711之后,它跳到位置730.当看到REMARK 465(缺失的原子)时,它表明对于链A,726-729缺失。然而,在将其与蛋白质匹配后,那些AA实际上是712-715。
我附加的代码适用于简单的3GFT示例,但如果有人是pdb文件的专家并且可以帮助我弄清楚其余部分,我会非常感激。
library(gdata)
#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L")
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A")
#This function reads a pdb file and returns the appropriate data structure
get.positions <- function(sourcefile, chain_required = "A"){
N <- 10^5
AACount <- 0
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)
visited = list()
filedata <- readLines(sourcefile, n= -1)
for(i in 1: length(filedata)){
input = filedata[i]
id = substr(input,1,4)
if(id == "ATOM"){
type = substr(input,14,15)
if(type == "CA"){
resSerial = strtoi(substr(input, 7,11))
residue = substr(input,18,20)
type_of_chain = substr(input,22,22)
resSeq = strtoi(substr(input, 23,26))
altLoc = substr(input,17,17)
if(resSeq >=1){ #does not include negative residues
if(type_of_chain == chain_required && !(resSerial %in% visited) && (altLoc == " " || altLoc == "A") ){
visited <- c(visited, resSerial)
AACount <- AACount + 1
position_string =list()
position_string[[1]]= as.numeric(substr(input,31,38))
position_string[[2]] =as.numeric(substr(input,39,46))
position_string[[3]] =as.numeric(substr(input,47,54))
#print(input)
positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]])
}
}
}
}
}
positions<-positions[1:AACount,]
positions[,2]<- as.numeric(positions[,2])
positions[,4]<- as.numeric(positions[,4])
positions[,5]<- as.numeric(positions[,5])
positions[,6]<- as.numeric(positions[,6])
return (positions)
}
答案 0 :(得分:1)
您可能希望将此问题移至www.biostars.org并写信至help@uniprot.org(您确实知道这些序列已经在数据库级别链接了吗?)无论如何写入帮助@ uniprot时.org询问Jules Jacobsen,因为他是将PDB结构与uniprot.org规范序列联系起来的常驻UniProt专家。
答案 1 :(得分:1)
这是一种方法。它需要bio3d R包(http://thegrantlab.org/bio3d/)和肌肉对齐可执行文件。我按照“其他实用程序”的说明进行操作http://thegrantlab.org/bio3d/tutorials/installing-bio3d
以下示例代码似乎适用于您列出的三种情况。
library(bio3d)
## Download PDB file with given 'id' (can also read from online)
id <- "3GFT" #"3V5Q"
file.name <- get.pdb(id)
pdb <- read.pdb(file.name)
pdb.seq <- pdbseq(pdb, atom.select(pdb, chain="A", elety="CA"))
## Get UniProt identifier and sequence
pdb.ano <- pdb.annotate(id, "db_id")
uni.seq <- get.seq(pdb.ano)
## Align sequences to define corespondences
aln <- seqaln( seqbind( pdb.seq, uni.seq), id=c(file.name, unlist(pdb.ano)) )
## Read aligned coordinate data (all the info you want is in here)
pdbs <- read.fasta.pdb(aln)
answer2 <- cbind( 1:ncol(pdbs$ali), t(pdbs$ali),
pdbs$resno[1,], matrix(pdbs$xyz[1,], ncol=3, byrow=T) )
head(answer2)
[1,] "1" "M" "M" "1" "62.935" "97.579" "30.223"
[2,] "2" "T" "T" "2" "63.155" "95.525" "27.079"
[3,] "3" "E" "E" "3" "65.289" "96.895" "24.308"
[4,] "4" "Y" "Y" "4" "64.899" "96.22" "20.615"
[5,] "5" "K" "K" "5" "67.593" "96.715" "18.023"
[6,] "6" "L" "L" "6" "65.898" "97.863" "14.816"
如果你想要用3个字母代码列出你的氨基酸,bio3d中有一个aa321()函数。