Question

我是矢量化的新手......这似乎是一个毛茸茸的问题，让它出来使用numpy而不是循环。

我有一组训练数据和一系列查询。我需要计算每个查询与训练数据的每个位之间的距离，然后对k个最近邻居进行排序。我可以在for循环中实现这一点，但速度很重要。此外，训练数据的格式使得它比进入的点更长的列表...我将显示：

 xtrain = [[0.5,0.3,0.1232141],...] #for a large number of items.

 xquery = [[0.1,0.2],[0.3,0.4],...] #for a small number of items.

我需要通过查询和训练数据之间的欧氏距离计算的距离......所以：

 def distance(p1,p2):
     sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))]
     return np.sqrt(sum_of_squares)

然后我需要对训练数据进行排序，将k取近，并对训练列表中的剩余值进行平均...

所以基本上，我需要一个使用xquery和xtrain的函数来生成一个如下所示的数组：

xdist = [[distance, last_value],... (k-times)], for each value of k]

传统的for循环看起来像：

def distance(p1,p2):
 sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
 return np.sqrt(sum_of_squares)

qX = data[train_rows:train_rows+5,0:-1]
k = 4

k_nearest_neighbors = [np.array(sorted([ (distance(qX[i],trainX[j]),trainX[j][-1]) for j in range(len(trainX))],key=lambda (x,y): x))[:k] for i in range(len(qX))]
predictions = [ np.average([j[1] for j in i]) for i in k_nearest_neighbors]

我在k_nearest neighbor步骤中保持紧凑;我意识到它并不清楚...但我认为从那里进行矢量化更容易。

无论如何，我知道如何用切片做这个...它似乎应该是可能的......

Answer 1

绝对有可能通过numpy广播来做到这一点。它看起来像这样：

D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
ind = np.argpartition(D, k, axis=1)[:, :k]
predictions = trainX[ind, 2].mean(1)

为了确认这是可行的，我们可以定义实现你的for-loop方法和我的广播方法的函数，并比较结果：

def with_for_loop(qX, trainX, k):
    def distance(p1,p2):
        sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
        return np.sqrt(sum_of_squares)

    k_nearest_neighbors = [np.array(sorted([(distance(qX[i],trainX[j]),trainX[j][-1])
                                            for j in range(len(trainX))],key=lambda t: t[0]))[:k]
                           for i in range(len(qX))]
    return [np.average([j[1] for j in i])
            for i in k_nearest_neighbors]

def with_broadcasting(qX, trainX, k):
    D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
    ind = np.argpartition(D, k, axis=1)[:, :k]
    return trainX[ind, 2].mean(1)

# Test the results:
np.random.seed(0)
trainX = np.random.rand(100, 3)
qX = np.random.rand(50, 2)

np.allclose(with_for_loop(qX, trainX, 4),
            with_broadcasting(qX, trainX, 4))
# True

请记住，随着数据的增长，使用基于树的方法scipy.spatial.cKDTree找到最近的邻居会更有效率：

from scipy.spatial import cKDTree

def with_kd_tree(qX, trainX, k):
    dist, ind = cKDTree(trainX[:, :2]).query(qX, k)
    return trainX[ind, 2].mean(1)

np.allclose(with_broadcasting(qX, trainX, 4),
            with_kd_tree(qX, trainX, 4))
# True

执行时，我们可以看到这些方法在使用更大的数据集时实质性地提高了性能：

np.random.seed(0)
trainX = np.random.rand(1000, 3)
qX = np.random.rand(1000, 2)

%timeit with_for_loop(qX, trainX, 4)
1 loops, best of 3: 7.16 s per loop

%timeit with_broadcasting(qX, trainX, 4)
10 loops, best of 3: 57.7 ms per loop

%timeit with_kd_tree(qX, trainX, 4)
1000 loops, best of 3: 1.61 ms per loop

使用np矩阵计算计算一组点之间的距离

1 个答案: