Question

我有一堆文件，随着时间的推移，各种粒子的XY位置。我试图通过粒子，看看他们是否需要根据满足一些距离标准连接在一起。我对编程很新，到目前为止我所尝试的一切都非常慢。我将所有这些数据存储在pandas数据帧中。例如，我可能有：

particle    frame   x   y
1             2    300  400
1             3    301  401
1             4    300  400
1             5    301  400
2             10   302  401
2             11   301  402
2             12   300  401

并希望它成为：

particle    frame   x   y
  1           2    300  400
  1           3    301  401
  1           4    300  400
  1           5    301  400
  1           10   302  401
  1           11   301  402
  1           12   300  401

由于粒子基本上在同一个位置，即使丢失了一些帧。实际的数据框可能有几百到几千个粒子。

我首先尝试简单地循环遍历每个粒子的最后一帧，然后在该循环中循环遍历所有其他粒子的第一帧：

    data=pd.read_excel(os.path.join(mydir,file),header=None,names=['particle','frame','x','y'],sheetname='Sheet3')
    exits=data.groupby('particle', as_index=False).apply(lambda p: p.tail(1))
    exits=exits.groupby('particle', as_index=False).filter(lambda p: p.frame.values[0]<301) #find all particle exits. 301 is the last frame, so nothing will be joined after this
    particles=exits['particle'].unique() #list of unique particles to check for links
    entries=data.groupby('particle').apply(lambda p: p.iloc[0]) 
    entries=entries.groupby('particle',as_index=False).filter(lambda p: p.frame.values[0]>2 ) #find all entry points for particles. if it enters in the first frame, it won't be linked to anything prior

    entries.sort_values('frame',inplace=True) # make sure the entries are in order, as we will stop after the closest match is found
    for i in range (0,len(particles)): # loop through each particle exit

        inddata=exits.iloc[i]  # get current exit

        subset_entries=entries[entries['frame'].values>inddata['frame']]# get list of particles that enter after this one exits
        for j in range (0,subset_entries.shape[0]): # go through all entries
            entry=subset_entries.iloc[j] # get current entry
            msd=conversion**2*((entry['x']-inddata['x'])**2+(entry['y']-inddata['y'])**2) # calculate distance requirement
            if msd<cutoff: # check if it is a match
                droppart=entry['particle'] # particle must be removed from list of entries so we don't match it to some other particle later
                entries=entries.drop(droppart) # drop the particle from entries
                if len(exits[exits['particle']==droppart])>0: #if the entry particle that was just linked is in the list of exits, we have to update the particleid
                    id=exits[exits['particle']==droppart].index.labels[0][0] 
                    exits.loc[id,'particle']=exits.iloc[i].particle # now any future exit has that particle number
                ind=data[data['particle']==droppart].index   # find location of linked particle in original dataframe 
                data.loc[ind,'particle']=exits.iloc[i].particle #link the particles together in the original dataframe 

                break # stop looking for matches to this particle

这个文件最多需要30分钟。然后我尝试通过退出，检查匹配的条目与一些lambda函数，大多数不知道我在做什么：

exits.groupby('particle').apply(lambda p: find_entries(p,entries,conversion,cutoff))

def find_entries(particle,entries,conversion,cutoff):    
comp_entries=entries.groupby('particle').filter(lambda q: q.frame.values[0]>particle.frame.values[0])
comp_entries=comp_entries.groupby('particle').filter(lambda q: conversion**2*((q.x.values[0]-particle.x.values[0])**2+(q.y.values[0]-particle.y.values[0])**2)<cutoff)
dist=comp_entries.groupby('particle').apply(lambda q: q.frame.values[0]-particle.frame.values[0])
if len(dist)>0:
    min_particle=dist.argmin()
    return min_particle
else:
    return NAN

这慢了很多倍，因为它计算所有条目，而不是在找到匹配后停止。实际上是否有减少每个粒子找到匹配的时间？我对编程很新，我真的不知道如何优化这样的事情;如果我所做的事情是编码形式不好等，我也会感谢任何一般的提示。

Answer 1

数据框对其他类型的计算有效。我认为你需要的是一个dict（关联数组，哈希）。在python中读取数据时，我会做类似的事情：

particles = {}
for line in open(datafile):
    particle,frame,x,y = line.split(delimiter)
    success = False
    for xcoord in range(x-1,x+2):
        for ycoord in range(y-1,y+2):
           #Try adding frame to existing location
           try: 
               particle = particles[(xoord,ycoord)]
               #Guess it exists
               frames[particle].append((frame,xcoord,ycoord)
               success = True
               break
           except KeyError: pass
        if success: break
    if not success:
        particles[(x,y)] = particle
        frames[particle] = [(frame,xcoord,ycoord)]

这是一种方式。我试图让它比简短更具可读性。保留现有粒子的哈希值，每个粒子都连接到元组列表（frame，x，y）。以这种方式阅读它我认为会更有效率。如果你需要大熊猫，你可以在之后转移它。

高效的粒子连接算法

1 个答案: