熊猫:获取具有相似(差异在一定范围内)列值的行对

时间:2019-11-28 09:35:37

标签: python pandas

我有一个Pandas数据框,其中包含1M行,3列(TrackP,TrackPt,NumLongTracks),并且我想查找成对的“匹配”行,例如对于两个“匹配”行,每行值之间的差异第1列(TrackP),第2列(TrackPt)和第3列(NumLongTracks)都在一定范围内,即不超过±1,

     TrackP  TrackPt   NumLongTracks
1    2801    544        102
2    2805    407        65
3    2802    587        70
4    2807    251        145
5    2802    543        101
6    2800    545        111

在这种情况下,您将只保留第1行和第5行对,因为对于该对

TrackP(row 1) - TrackP(row 5) = -1, 
TrackPt(row 1) - TrackP(row 5) = +1, 
NumLongTracks(row 1) - NumLongTracks(row 5) = +1

当行之间的值完全相同时,这是微不足道的,但是对于这种特殊情况,我很难找出最佳方法。

1 个答案:

答案 0 :(得分:0)

我认为将列作为单个值进行比较容易处理。

#new dataframe
tr = track.TrackP.astype(str) + track.TrackPt.astype(str) +   track.NumLongTracks.astype(str)   

# finding matching routes 
matching = []                           
for r,i in zip(tr,tr.index):
    if r[0:4]: #4 position is TrackP
        close   = (int(r[0:4])-1,int(r[0:4])+1) #range 1 up/down
        ptRange = (int(r[5:7])-1,int(r[5:7])+1)
        nLRange = (int(r[8:])-1,int(r[8:])+1)

        for r2 in tr:
            if int(r2[0:4]) in close: #TrackP in range  
                if int(r2[5:7]) in ptRange: #TrackPt in range
                    if int(r2[8:]) in nLRange: #NumLongTracks in range
                        pair = [r,r2]
                        matching.append(pair)


# back to the format
#[['2801544102', '2802543101'], ['2802543101', '2801544102']]   
import collections

routes = collections.defaultdict(list)

for seq in matching:
    routes['TrackP'].append(int(seq[0][0:4]))
    routes['TrackPt'].append(int(seq[0][4:7]))
    routes['NumLongTracks'].append(int(seq[0][7:]))

现在,您可以使用以下公式轻松地在数据框中解压缩:

df = pd.DataFrame.from_dict(dict(routes))

print(df)
TrackP  TrackPt  NumLongTracks
0    2801      544            102
1    2802      543            101