我是矢量化的新手......这似乎是一个毛茸茸的问题,让它出来使用numpy而不是循环。
我有一组训练数据和一系列查询。我需要计算每个查询与训练数据的每个位之间的距离,然后对k个最近邻居进行排序。我可以在for循环中实现这一点,但速度很重要。此外,训练数据的格式使得它比进入的点更长的列表...我将显示:
xtrain = [[0.5,0.3,0.1232141],...] #for a large number of items.
xquery = [[0.1,0.2],[0.3,0.4],...] #for a small number of items.
我需要通过查询和训练数据之间的欧氏距离计算的距离......所以:
def distance(p1,p2):
sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))]
return np.sqrt(sum_of_squares)
然后我需要对训练数据进行排序,将k取近,并对训练列表中的剩余值进行平均...
所以基本上,我需要一个使用xquery和xtrain的函数来生成一个如下所示的数组:
xdist = [[distance, last_value],... (k-times)], for each value of k]
传统的for循环看起来像:
def distance(p1,p2):
sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
return np.sqrt(sum_of_squares)
qX = data[train_rows:train_rows+5,0:-1]
k = 4
k_nearest_neighbors = [np.array(sorted([ (distance(qX[i],trainX[j]),trainX[j][-1]) for j in range(len(trainX))],key=lambda (x,y): x))[:k] for i in range(len(qX))]
predictions = [ np.average([j[1] for j in i]) for i in k_nearest_neighbors]
我在k_nearest neighbor步骤中保持紧凑;我意识到它并不清楚...但我认为从那里进行矢量化更容易。
无论如何,我知道如何用切片做这个...它似乎应该是可能的......
答案 0 :(得分:1)
绝对有可能通过numpy广播来做到这一点。它看起来像这样:
D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
ind = np.argpartition(D, k, axis=1)[:, :k]
predictions = trainX[ind, 2].mean(1)
为了确认这是可行的,我们可以定义实现你的for-loop方法和我的广播方法的函数,并比较结果:
def with_for_loop(qX, trainX, k):
def distance(p1,p2):
sum_of_squares = sum([(p1[i] - p2[i])**2.0 for i in range(len(p1))])
return np.sqrt(sum_of_squares)
k_nearest_neighbors = [np.array(sorted([(distance(qX[i],trainX[j]),trainX[j][-1])
for j in range(len(trainX))],key=lambda t: t[0]))[:k]
for i in range(len(qX))]
return [np.average([j[1] for j in i])
for i in k_nearest_neighbors]
def with_broadcasting(qX, trainX, k):
D = np.sum((qX[:, None, :] - trainX[None, :, :2]) ** 2, -1)
ind = np.argpartition(D, k, axis=1)[:, :k]
return trainX[ind, 2].mean(1)
# Test the results:
np.random.seed(0)
trainX = np.random.rand(100, 3)
qX = np.random.rand(50, 2)
np.allclose(with_for_loop(qX, trainX, 4),
with_broadcasting(qX, trainX, 4))
# True
请记住,随着数据的增长,使用基于树的方法scipy.spatial.cKDTree
找到最近的邻居会更有效率:
from scipy.spatial import cKDTree
def with_kd_tree(qX, trainX, k):
dist, ind = cKDTree(trainX[:, :2]).query(qX, k)
return trainX[ind, 2].mean(1)
np.allclose(with_broadcasting(qX, trainX, 4),
with_kd_tree(qX, trainX, 4))
# True
执行时,我们可以看到这些方法在使用更大的数据集时实质性地提高了性能:
np.random.seed(0)
trainX = np.random.rand(1000, 3)
qX = np.random.rand(1000, 2)
%timeit with_for_loop(qX, trainX, 4)
1 loops, best of 3: 7.16 s per loop
%timeit with_broadcasting(qX, trainX, 4)
10 loops, best of 3: 57.7 ms per loop
%timeit with_kd_tree(qX, trainX, 4)
1000 loops, best of 3: 1.61 ms per loop