我有一个Pandas数据框,其中包含1M行,3列(TrackP,TrackPt,NumLongTracks),并且我想查找成对的“匹配”行,例如对于两个“匹配”行,每行值之间的差异第1列(TrackP),第2列(TrackPt)和第3列(NumLongTracks)都在一定范围内,即不超过±1,
TrackP TrackPt NumLongTracks
1 2801 544 102
2 2805 407 65
3 2802 587 70
4 2807 251 145
5 2802 543 101
6 2800 545 111
在这种情况下,您将只保留第1行和第5行对,因为对于该对
TrackP(row 1) - TrackP(row 5) = -1,
TrackPt(row 1) - TrackP(row 5) = +1,
NumLongTracks(row 1) - NumLongTracks(row 5) = +1
当行之间的值完全相同时,这是微不足道的,但是对于这种特殊情况,我很难找出最佳方法。
答案 0 :(得分:0)
我认为将列作为单个值进行比较容易处理。
#new dataframe
tr = track.TrackP.astype(str) + track.TrackPt.astype(str) + track.NumLongTracks.astype(str)
# finding matching routes
matching = []
for r,i in zip(tr,tr.index):
if r[0:4]: #4 position is TrackP
close = (int(r[0:4])-1,int(r[0:4])+1) #range 1 up/down
ptRange = (int(r[5:7])-1,int(r[5:7])+1)
nLRange = (int(r[8:])-1,int(r[8:])+1)
for r2 in tr:
if int(r2[0:4]) in close: #TrackP in range
if int(r2[5:7]) in ptRange: #TrackPt in range
if int(r2[8:]) in nLRange: #NumLongTracks in range
pair = [r,r2]
matching.append(pair)
# back to the format
#[['2801544102', '2802543101'], ['2802543101', '2801544102']]
import collections
routes = collections.defaultdict(list)
for seq in matching:
routes['TrackP'].append(int(seq[0][0:4]))
routes['TrackPt'].append(int(seq[0][4:7]))
routes['NumLongTracks'].append(int(seq[0][7:]))
现在,您可以使用以下公式轻松地在数据框中解压缩:
df = pd.DataFrame.from_dict(dict(routes))
print(df)
TrackP TrackPt NumLongTracks
0 2801 544 102
1 2802 543 101