我在python中写了一个k最近邻分类器。我遇到了阵列操作耗时太长的问题。
def classify(k, train_data, target):
num_rows = train_data.shape[0]
num_cols = train_data.shape[1]
distances = []
candidates = [0] * 10
for i, row in enumerate(train_data):
dist = euclidean_dist(target[:num_cols - 1], row[:num_cols - 1]) #slow
distances.append((dist, row[num_cols - 1]))
distances.sort(key=lambda tup: tup[0])
distances = distances[:k]
for i, d in enumerate(distances):
candidates[d[1]] += 1
return candidates.index(max(candidates))
def euclidean_dist(x1, x2):
assert(len(x1) == len(x2))
result = 0
pdb.set_trace()
for i in range(len(x1)): #culprit, x1 and x2 are both 256 length lists
result += math.pow(x1[i] - x2[i], 2)
result = math.sqrt(result)
return result
我在上面的代码中评论了显示故障发生的位置。欢迎提出任何更快的建议。
答案 0 :(得分:2)
看起来你只想要欧几里德距离/ 2norm,你可以通过numpy(导入为np
)非常有效地获得:
def euclidean_dist2(x1, x2):
assert(len(x1) == len(x2))
x1 = np.array(x1)
x2 = np.array(x2)
norm = np.linalg.norm(x1-x2)
return norm
print euclidean_dist2([1,2],[4,7])
这将为您提供5.83095189485,与您的功能相同,但是矢量化
打破它,你只是采取元素明智的区别,将得到的矢量乘以它自己(将其平方),求和,然后生成总和:
def euclidean_dist3(x1, x2):
assert(len(x1) == len(x2))
x1 = np.array(x1)
x2 = np.array(x2)
diff = x1 - x2
squared = np.transpose(diff) * diff
summed = sum(squared)
norm = np.sqrt(summed)
return norm
换句话说,你只是将差异向量的点积与自身结合起来:
def euclidean_dist4(x1, x2):
assert(len(x1) == len(x2))
x1 = np.array(x1)
x2 = np.array(x2)
diff = x1 - x2
dot = np.dot(diff, diff)
norm = np.sqrt(dot)
return norm
实现同样目标的不同方式