Question

蛋白质contact order（CO）是残基间接触的位置。 CO也与蛋白质的折叠速度相关。较高的接触顺序表明较长的折叠时间，并且低接触顺序被建议作为潜在的下坡折叠或没有自由能量障碍的蛋白质折叠的预测因子。

有一个内联网络服务器和一个perl脚本来计算CO here。

有没有办法在python中计算CO？

Answer 1

确实有！使用优秀的MDTraj这不是问题：

import numpy as np
import mdtraj as md

@jit
def abs_contact_order(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = xyz[j] - xyz[i]
                if np.dot(d, d) < cutoff_2:
                    seq_distance_sum += seq_dist 
                    contact_count += 1


    if contact_count==0.:
        print("Warning, no contacts found!")
        return 0.

    print("contact_count: ", contact_count)
    print("seq_distance_sum: ", seq_distance_sum)
    return seq_distance_sum/float(contact_count)

使用：

运行

traj = md.load("test.pdb")
print("Atoms: ", traj.n_atoms)
seq_atoms = np.array([a.residue.resSeq for a in traj.top.atoms], dtype=int)

abs_co = abs_contact_order(traj.xyz[0], seq_atoms, cutoff_nm=0.60)

print("Absolute Contact Order: ", abs_co)
print("Relative Contact Order: ", abs_co/traj.n_residues)

如果没有numba，则需要16秒，只需添加一个@jit，时间就减少到~1s。

原始的perl脚本需要大约8秒

perl contactOrder.pl -r -a -c 6 test.pdb

测试文件和jupyter笔记本是here。

Answer 2

你可以利用这样一个事实，即你知道你的坐标数组在形状上总是(N,3)，所以你可以摆脱数组创建并调用np.dot，这对于真正的效率来说效率较低像这里的小阵列。这样做，我已将您的功能重新编写为abs_contact_order2：

from __future__ import print_function, division

import mdtraj as md
import numpy as np
import numba as nb

@nb.njit
def abs_contact_order(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = xyz[j] - xyz[i]
                if np.dot(d, d) < cutoff_2:
                    seq_distance_sum += seq_dist 
                    contact_count += 1


    if contact_count==0.:
        return 0.

    return seq_distance_sum/float(contact_count)

@nb.njit
def abs_contact_order2(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = 0.0
                for k in xrange(3):
                    d += (xyz[j,k] - xyz[i,k])**2
                if d < cutoff_2:
                    seq_distance_sum += seq_dist 
                    contact_count += 1


    if contact_count==0.:
        return 0.

    return seq_distance_sum/float(contact_count)

然后时间：

%timeit abs_co = abs_contact_order(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
1 loop, best of 3: 723 ms per loop

%timeit abs_co = abs_contact_order2(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
10 loops, best of 3: 28.2 ms per loop

我从函数中取出了print语句，允许您以nopython模式运行Numba jit。如果您真的需要这些信息，我将返回所有必要的数据，并编写一个薄的包装器来检查返回值并根据需要打印任何调试信息。

另请注意，在对Numba函数进行计时时，您应首先通过调用定时循环外的函数来“预热”jit。通常，如果您要多次调用该函数，Numba jit函数所需的时间大于单次调用的时间，因此您无法很好地了解代码的速度。但是，如果您只打算调用该函数一次并且启动时间很重要，请继续按原样计时。

<强>更新

你可以进一步提高速度，因为你循环（i，j）和（j，i）对并且它们是对称的：

@nb.njit
def abs_contact_order3(xyz, atoms_residue, cutoff_nm=.6):
    """Return the absolute contact order."""
    contact_count = 0
    seq_distance_sum = 0
    cutoff_2 = cutoff_nm*cutoff_nm
    N = len(atoms_residue)
    for i in xrange(N):
        for j in xrange(i+1,N):
            seq_dist = atoms_residue[j] - atoms_residue[i]
            if seq_dist > 0:
                d = 0.0
                for k in xrange(3):
                    d += (xyz[j,k] - xyz[i,k])**2
                if d < cutoff_2:
                    seq_distance_sum += 2*seq_dist 
                    contact_count += 2


    if contact_count==0.:
        return 0.

    return seq_distance_sum/float(contact_count)

和时间：

%timeit abs_co = abs_contact_order3(traj.xyz[0], seq_atoms, cutoff_nm=0.60)
100 loops, best of 3: 15.7 ms per loop

在Python中计算蛋白质接触顺序

2 个答案: