如何在Python中向量化数组操作

时间:2016-02-02 06:54:02

标签: python machine-learning

我在python中写了一个k最近邻分类器。我遇到了阵列操作耗时太长的问题。

def classify(k, train_data, target):
    num_rows = train_data.shape[0]
    num_cols = train_data.shape[1]
    distances = []
    candidates = [0] * 10

    for i, row in enumerate(train_data):
        dist = euclidean_dist(target[:num_cols - 1], row[:num_cols - 1]) #slow
        distances.append((dist, row[num_cols - 1]))

    distances.sort(key=lambda tup: tup[0])
    distances = distances[:k]

    for i, d in enumerate(distances):
        candidates[d[1]] += 1

    return candidates.index(max(candidates))

def euclidean_dist(x1, x2):
    assert(len(x1) == len(x2))
    result = 0

    pdb.set_trace()
    for i in range(len(x1)): #culprit, x1 and x2 are both 256 length lists
        result += math.pow(x1[i] - x2[i], 2)
    result = math.sqrt(result)

    return result

我在上面的代码中评论了显示故障发生的位置。欢迎提出任何更快的建议。

1 个答案:

答案 0 :(得分:2)

看起来你只想要欧几里德距离/ 2norm,你可以通过numpy(导入为np)非常有效地获得:

def euclidean_dist2(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    norm = np.linalg.norm(x1-x2)

    return norm

print euclidean_dist2([1,2],[4,7])

这将为您提供5.83095189485,与您的功能相同,但是矢量化

打破它,你只是采取元素明智的区别,将得到的矢量乘以它自己(将其平方),求和,然后生成总和:

def euclidean_dist3(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    diff = x1 - x2

    squared = np.transpose(diff) * diff

    summed = sum(squared)

    norm = np.sqrt(summed)

    return norm

换句话说,你只是将差异向量的点积与自身结合起来:

def euclidean_dist4(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    diff = x1 - x2

    dot = np.dot(diff, diff)

    norm = np.sqrt(dot)

    return norm

实现同样目标的不同方式