我试图使用scipy的kd-tree从pdb文件中查询模型的最近邻居。我目前实施了一种蛮力方法,我将每个模型的rmsd值与其他模型进行比较。我想通过使用kd-tree来加快找到每个模型最近邻居的时间。
作为参考,我正在使用的pdb文件样本在一个文件中有多个模型:
MODEL 5
HETATM 1 C1 SIN A 0 13.542 -2.290 0.745 1.00 0.00 C
HETATM 2 O1 SIN A 0 14.446 -2.652 0.010 1.00 0.00 O
HETATM 3 O2 SIN A 0 12.378 -2.189 0.395 1.00 0.00 O
...
TER 627 NH2 A 39
ENDMDL
MODEL 6
HETATM 1 C1 SIN A 0 11.762 2.281 -7.835 1.00 0.00 C
ATOM 26 C TRP A 2 11.341 6.316 -0.847 1.00 0.00 C
ATOM 27 O TRP A 2 11.074 6.179 0.330 1.00 0.00 O
ATOM 28 CB TRP A 2 13.182 7.844 -1.538 1.00 0.00 C
ATOM 29 CG TRP A 2 12.069 8.524 -2.259 1.00 0.00 C
...
HETATM 626 HN2 NH2 A 39 3.093 9.404 -6.782 1.00 0.00 H
TER 627 NH2 A 39
ENDMDL
MODEL 7
HETATM 1 C1 SIN A 0 -16.074 -1.515 -4.262 1.00 0.00 C
HETATM 2 O1 SIN A 0 -16.968 -1.910 -4.992 1.00 0.00 O
...
ATOM 18 OD1 ASP A 1 -12.877 3.426 -8.525 1.00 0.00 O
ATOM 19 OD2 ASP A 1 -13.484 1.785 -9.782 1.00 0.00 O
TER 627 NH2 A 39
ENDMDL
我最初的尝试是将每个模型表示为一个列表,其中包含一个原子坐标列表,每个3D原子坐标由一个列表表示:
print(model_coord)
[
[[1.4579, 0.0, 0.0],... ,[-5.5, 21.5529, 23.7390]],
[[16.5450, 3.3699, 10.1888], ... ,[-0.0963, 24.510883331298828, 20.2952]],
[[17.6256, 2.5858, 12.4808],... ,[-11.6052, 13.1031, 23.8958]]
]
我在创建kdtree
对象时收到以下错误:
kdtree = scipy.spatial.KDTree(model_coord)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 235, in __init__
self.n, self.m = np.shape(self.data)
ValueError: too many values to unpack
但是,将model_coord
转换为panada数据帧可以让我获得n乘m的要求来创建kdtree
对象,其中每行代表一个模型,列3D原子坐标为:
model_df = pd.DataFrame(model_coord)
print(model_df.to_string())
0 1 2 ...
0 [1.45799, 0.0, 0.0] [3.9140, 2.8670, 0.4530] [7.590, 3.7990, 0.1850] ...
1 [16.5450, 3.3699, 10.1888] [15.9148, 1.9402, 13.6552] [14.4702, 2.6485, 17.0995] ...
2 [17.6256, 2.5858, 12.4808] [16.4266, 2.2781, 16.0749] [12.6480, 2.6846, 16.0066] …
这是我尝试使用模型查询radius的最近邻居,其中epsilon
是半径:
kdtree = scipy.spatial.KDTree(model_df)
for index, model in model_df.iterrows():
model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
由于坐标是列表对象,因此收到以下错误:
model_nn_dist, model_nn_ids=kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
hits = self.__query(x, k=k, eps=eps, p=p,distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
TypeError: unsupported operand type(s) for -: 'list' and ‘list'
尝试通过将原子坐标转换为numpy数组来解决此问题,但是,这是我收到的错误:
model_nn_dist, model_nn_ids = kdtree.query(model,distance_upper_bound=epsilon)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 521, in query
hits = self.__query(x, k=k, eps=eps, p=p, distance_upper_bound=distance_upper_bound)
File "/Library/Python/2.7/site-packages/scipy/spatial/kdtree.py", line 320, in __query
side_distances = np.maximum(0,np.maximum(x-self.maxes,self.mins-x))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我想知道是否有更好的方法或更合适的数据结构来使用kd树来查询模型或坐标集的最近邻居。