我想计算N个粒子的均方位移,其中我有粒子的位置随时间的轨迹。我写的代码有3个for循环,这使得它非常慢。你能帮助我如何用numpy或pandas的某种矢量化功能替换循环吗?
这是我的代码:
ntime = 10 # number of times represented in data
atom_count = 3 # number of particles
norigin = 5 # number of origins is half number of time steps
nmin = 2 # minimum number of intervals to contribute to diffusivity
nmax = norigin # maximum number of intervals to contribute to diffusivity
dt = 1.0 # timestep
# creating sample trajectory of particles
traj = pd.DataFrame(np.random.rand(ntime*atom_count,3), columns=['x', 'y', 'z'])
traj['frame_id'] = np.repeat(np.arange(ntime)+1, atom_count)
traj['particle_id'] = np.tile(np.arange(atom_count)+1, ntime)
traj = traj[['frame_id', 'particle_id', 'x', 'y', 'z']]
print(traj.head(6))
ndata = traj.shape[0] # number of rows of data
# store mean square displacements in msd
time_vec= np.arange(dt, norigin*dt+1, dt)
msd_xyz = np.zeros((norigin,3))
# loop over all particles
for i in range(atom_count):
# loop over all time origins
for j in range(norigin):
jstart = j*atom_count + i
# loop over all time windows
for k in range(nmin, nmax):
kend = jstart + k*atom_count
msd_xyz[k, :] += (traj.ix[kend, ['x', 'y', 'z']].values -
traj.ix[jstart, ['x', 'y', 'z']].values)**2
msd_xyz = msd_xyz / (atom_count * norigin)
msd = np.mean(msd_xyz, axis=1) # total MSD averaged over x, y, z directions
print()
print("MSD (averaged over all particles and time origins):")
print(msd)
答案 0 :(得分:1)
使用numpy的索引功能,可以在meshgrid的帮助下对所有3个嵌套循环进行矢量化。
这个工作的关键是numpy数组支持任何形状的列表或数组索引:
a = np.arange(5,10)
b = np.array([[0,2,4],[3,3,0]])
print(a[b])
# Output
[[5 7 9]
[8 8 5]]
因此,我们可以从循环中用作迭代器的数组定义一个meshgrid,以便立即从循环中检索i,j和k的所有组合,然后在i和j上求和。
重要的是要注意数组索引已在.values
方法之后移动,因为numpy支持这种索引,但pandas仅适用于1D数组。
# define indexing arrays
k = np.arange(nmin,nmax)
j = np.arange(norigin)
i = np.arange(atom_count)
I,J,K = np.meshgrid(i,j,k) # the meshgrid contains all the combinations of i,j,k,
# it is equivalent to the 3 nested loops
jstart = J*atom_count + I
kend = jstart + K*atom_count
msd_xyz[k,:] = np.sum(np.sum((traj[['x', 'y', 'z']].values[kend,:] -
traj[['x', 'y', 'z']].values[jstart,:])**2,axis=0),axis=0)
msd_xyz = msd_xyz / (atom_count * norigin)
msd = np.mean(msd_xyz, axis=1) # total MSD averaged over x, y, z directions
对于问题中的示例数据的维度,相对于3个嵌套循环,这实现了x60加速。但是,对于大型数据帧,这可能会占用太多内存并变得更慢,在这种情况下,最好将循环和矢量化组合起来,并且只对一个或两个循环进行矢量化以避免过多的内存使用。