使用kd-tree

时间:2018-03-22 00:31:20

标签: python scipy bioinformatics biopython kdtree

我试图使用scipy的kd-tree从pdb文件中查询模型的最近邻居。我目前实施了一种蛮力方法,我将每个模型的rmsd值与其他模型进行比较。我想通过使用kd-tree来加快找到每个模型最近邻居的时间。

作为参考,我正在使用的pdb文件样本在一个文件中有多个模型:

MODEL        5                                                                  
HETATM    1  C1  SIN A   0      13.542  -2.290   0.745  1.00  0.00           C  
HETATM    2  O1  SIN A   0      14.446  -2.652   0.010  1.00  0.00           O  
HETATM    3  O2  SIN A   0      12.378  -2.189   0.395  1.00  0.00           O  
...
TER     627      NH2 A  39                                                      
ENDMDL                                                                          
MODEL        6                                                                  
HETATM    1  C1  SIN A   0      11.762   2.281  -7.835  1.00  0.00           C  
ATOM     26  C   TRP A   2      11.341   6.316  -0.847  1.00  0.00           C  
ATOM     27  O   TRP A   2      11.074   6.179   0.330  1.00  0.00           O  
ATOM     28  CB  TRP A   2      13.182   7.844  -1.538  1.00  0.00           C  
ATOM     29  CG  TRP A   2      12.069   8.524  -2.259  1.00  0.00           C  
...
HETATM  626  HN2 NH2 A  39       3.093   9.404  -6.782  1.00  0.00           H  
TER     627      NH2 A  39                                                      
ENDMDL                                                                          
MODEL        7                                                                  
HETATM    1  C1  SIN A   0     -16.074  -1.515  -4.262  1.00  0.00           C  
HETATM    2  O1  SIN A   0     -16.968  -1.910  -4.992  1.00  0.00           O  
...
ATOM     18  OD1 ASP A   1     -12.877   3.426  -8.525  1.00  0.00           O  
ATOM     19  OD2 ASP A   1     -13.484   1.785  -9.782  1.00  0.00           O  
TER     627      NH2 A  39                                                      
ENDMDL

我最初的尝试是将每个模型表示为一个列表,其中包含一个原子坐标列表,每个3D原子坐标由一个列表表示:

print(model_coord)

[
 [[1.4579, 0.0, 0.0],... ,[-5.5, 21.5529, 23.7390]],
 [[16.5450, 3.3699, 10.1888], ... ,[-0.0963, 24.510883331298828, 20.2952]], 
 [[17.6256, 2.5858, 12.4808],... ,[-11.6052, 13.1031, 23.8958]]
]

我在创建kdtree对象时收到以下错误:

kdtree = scipy.spatial.KDTree(model_coord)
  File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 235, in __init__
     self.n, self.m = np.shape(self.data)
ValueError: too many values to unpack

但是,将model_coord转换为panada数据帧可以让我获得n乘m的要求来创建kdtree对象,其中每行代表一个模型,列3D原子坐标为:

model_df = pd.DataFrame(model_coord)
print(model_df.to_string())

    0                      1                         2 ...
0  [1.45799, 0.0, 0.0]    [3.9140, 2.8670, 0.4530]  [7.590, 3.7990, 0.1850] ...
1  [16.5450, 3.3699, 10.1888]  [15.9148, 1.9402, 13.6552] [14.4702, 2.6485, 17.0995] ...
2  [17.6256, 2.5858, 12.4808] [16.4266, 2.2781, 16.0749] [12.6480, 2.6846, 16.0066] …

这是我尝试使用模型查询radius的最近邻居,其中epsilon是半径:

kdtree = scipy.spatial.KDTree(model_df)
for index, model in model_df.iterrows():
    model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)

由于坐标是列表对象,因此收到以下错误:

  model_nn_dist, model_nn_ids=kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
  hits = self.__query(x, k=k, eps=eps, p=p,distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
  side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
TypeError: unsupported operand type(s) for -: 'list' and ‘list'

尝试通过将原子坐标转换为numpy数组来解决此问题,但是,这是我收到的错误:

  model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
  hits = self.__query(x, k=k, eps=eps, p=p, distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
  side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

我想知道是否有更好的方法或更合适的数据结构来使用kd树来查询模型或坐标集的最近邻居。

0 个答案:

没有答案