如何调整此代码以同时返回第二个和第三个“最近的邻居”?

时间:2019-12-19 20:50:26

标签: python pandas knn nearest-neighbor euclidean-distance

基于calculating average distance of nearest neighbours in pandas dataframe中的代码,如何调整它,以便将第二个和第三个最近的邻居返回到新列中?

(或创建一个可调参数来定义要返回的邻居数):

示例代码:

import numpy as np 
from sklearn.neighbors import NearestNeighbors
import pandas as pd

def nn(x):
    nbrs = NearestNeighbors(
        n_neighbors=2, 
        algorithm='auto', 
        metric='euclidean'
    ).fit(x)
    distances, indices = nbrs.kneighbors(x)
    return distances, indices

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56] 
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})

#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))

groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
    group = groups.get_group(i)
    for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
        nn_rows.append({'time': i,
                    'car': group.iloc[j]['car'],
                    'nearest_neighbour': group.iloc[tup[1][1]]['car'],
                    'euclidean_distance': tup[0][1]})

nn_df = pd.DataFrame(nn_rows).set_index('time')

结果数据框:

>>> nn_df
time car euclidean_distance nearest_neighbour           
0    1   1.414214           3
0    2   1.000000           3
0    3   1.000000           2
1    1   10.049876          3
1    3   10.049876          1
2    4   53.037722          5
2    5   53.037722          4

如何获取NEAREST NEIGHBOR 2、3和N的输出并将其插入新列?

1 个答案:

答案 0 :(得分:1)

这是NearestNeighbors方法的文档。

我认为可以使用n_neighbors参数解决您的问题。该参数指定要返回的最近邻居数的indices and distances

当我们旨在查找点本身以外的单个最近邻居时,通常使用的值是 2 。最接近的邻居总是自身,因为距离为0。

要查找第二个和第三个最近的邻居,应将n_neighbors设置为4。这将返回该点本身,然后是下一个N-1个最近的邻居

# Argument
n_neighbor = 4

# Indices
[point_itself, neighbor_1, neighbor_2, neighbor_3]

# Distances
[ 0, distance_1, distance_2, distance_3]