我有一堆文件,随着时间的推移,各种粒子的XY位置。我试图通过粒子,看看他们是否需要根据满足一些距离标准连接在一起。我对编程很新,到目前为止我所尝试的一切都非常慢。我将所有这些数据存储在pandas数据帧中。例如,我可能有:
particle frame x y
1 2 300 400
1 3 301 401
1 4 300 400
1 5 301 400
2 10 302 401
2 11 301 402
2 12 300 401
并希望它成为:
particle frame x y
1 2 300 400
1 3 301 401
1 4 300 400
1 5 301 400
1 10 302 401
1 11 301 402
1 12 300 401
由于粒子基本上在同一个位置,即使丢失了一些帧。实际的数据框可能有几百到几千个粒子。
我首先尝试简单地循环遍历每个粒子的最后一帧,然后在该循环中循环遍历所有其他粒子的第一帧:
data=pd.read_excel(os.path.join(mydir,file),header=None,names=['particle','frame','x','y'],sheetname='Sheet3')
exits=data.groupby('particle', as_index=False).apply(lambda p: p.tail(1))
exits=exits.groupby('particle', as_index=False).filter(lambda p: p.frame.values[0]<301) #find all particle exits. 301 is the last frame, so nothing will be joined after this
particles=exits['particle'].unique() #list of unique particles to check for links
entries=data.groupby('particle').apply(lambda p: p.iloc[0])
entries=entries.groupby('particle',as_index=False).filter(lambda p: p.frame.values[0]>2 ) #find all entry points for particles. if it enters in the first frame, it won't be linked to anything prior
entries.sort_values('frame',inplace=True) # make sure the entries are in order, as we will stop after the closest match is found
for i in range (0,len(particles)): # loop through each particle exit
inddata=exits.iloc[i] # get current exit
subset_entries=entries[entries['frame'].values>inddata['frame']]# get list of particles that enter after this one exits
for j in range (0,subset_entries.shape[0]): # go through all entries
entry=subset_entries.iloc[j] # get current entry
msd=conversion**2*((entry['x']-inddata['x'])**2+(entry['y']-inddata['y'])**2) # calculate distance requirement
if msd<cutoff: # check if it is a match
droppart=entry['particle'] # particle must be removed from list of entries so we don't match it to some other particle later
entries=entries.drop(droppart) # drop the particle from entries
if len(exits[exits['particle']==droppart])>0: #if the entry particle that was just linked is in the list of exits, we have to update the particleid
id=exits[exits['particle']==droppart].index.labels[0][0]
exits.loc[id,'particle']=exits.iloc[i].particle # now any future exit has that particle number
ind=data[data['particle']==droppart].index # find location of linked particle in original dataframe
data.loc[ind,'particle']=exits.iloc[i].particle #link the particles together in the original dataframe
break # stop looking for matches to this particle
这个文件最多需要30分钟。然后我尝试通过退出,检查匹配的条目与一些lambda函数,大多数不知道我在做什么:
exits.groupby('particle').apply(lambda p: find_entries(p,entries,conversion,cutoff))
def find_entries(particle,entries,conversion,cutoff):
comp_entries=entries.groupby('particle').filter(lambda q: q.frame.values[0]>particle.frame.values[0])
comp_entries=comp_entries.groupby('particle').filter(lambda q: conversion**2*((q.x.values[0]-particle.x.values[0])**2+(q.y.values[0]-particle.y.values[0])**2)<cutoff)
dist=comp_entries.groupby('particle').apply(lambda q: q.frame.values[0]-particle.frame.values[0])
if len(dist)>0:
min_particle=dist.argmin()
return min_particle
else:
return NAN
这慢了很多倍,因为它计算所有条目,而不是在找到匹配后停止。实际上是否有减少每个粒子找到匹配的时间?我对编程很新,我真的不知道如何优化这样的事情;如果我所做的事情是编码形式不好等,我也会感谢任何一般的提示。
答案 0 :(得分:0)
数据框对其他类型的计算有效。我认为你需要的是一个dict(关联数组,哈希)。在python中读取数据时,我会做类似的事情:
particles = {}
for line in open(datafile):
particle,frame,x,y = line.split(delimiter)
success = False
for xcoord in range(x-1,x+2):
for ycoord in range(y-1,y+2):
#Try adding frame to existing location
try:
particle = particles[(xoord,ycoord)]
#Guess it exists
frames[particle].append((frame,xcoord,ycoord)
success = True
break
except KeyError: pass
if success: break
if not success:
particles[(x,y)] = particle
frames[particle] = [(frame,xcoord,ycoord)]
这是一种方式。我试图让它比简短更具可读性。保留现有粒子的哈希值,每个粒子都连接到元组列表(frame,x,y)。以这种方式阅读它我认为会更有效率。如果你需要大熊猫,你可以在之后转移它。