Question

我正在使用Python开发一个KNN分类器，但是我遇到了一些问题。下面的代码需要7.5s-9.0s才能完成，我将需要运行60.000次。

        for fold in folds:  
            for dot2 in fold:
                """
                distances[x][0] = Class of the dot2
                distances[x][1] = distance between dot1 and dot2
                """
                distances.append([dot2[0], calc_distance(dot1[1:], dot2[1:], method)])

“folds”变量是一个10倍的列表，总计包含60,000个.csv格式的图像输入。每个点的第一个值是它所属的类。所有值都是整数。有没有办法让这条线运行得更快？

这是calc_distance函数

def calc_distancia(dot1, dot2, distance):

if distance == "manhanttan":
    total = 0
    #for each coord, take the absolute difference
    for x in range(0, len(dot1)):
        total = total + abs(dot1[x] - dot2[x])
    return total

elif distance == "euclidiana":
    total = 0
    for x in range(0, len(dot1)):
        total = total + (dot1[x] - dot2[x])**2
    return math.sqrt(total)

elif distance == "supremum":
    total = 0
    for x in range(0, len(dot1)):
        if abs(dot1[x] - dot2[x]) > total:
            total = abs(dot1[x] - dot2[x])
    return total

elif distance == "cosseno":
    dist = 0
    p1_p2_mul = 0
    p1_sum = 0
    p2_sum = 0
    for x in range(0, len(dot1)):
        p1_p2_mul = p1_p2_mul + dot1[x]*dot2[x]
        p1_sum = p1_sum + dot1[x]**2
        p2_sum = p2_sum + dot2[x]**2
    p1_sum = math.sqrt(p1_sum)
    p2_sum = math.sqrt(p2_sum)
    quociente = p1_sum*p2_sum
    dist = p1_p2_mul/quociente

    return dist

编辑：找到一种方法，使其至少对于“manhanttan”方法更快。而不是：

    if distance == "manhanttan":
    total = 0
    #for each coord, take the absolute difference
    for x in range(0, len(dot1)):
        total = total + abs(dot1[x] - dot2[x])
    return total

我把

    if distance == "manhanttan":
    totalp1 = 0
    totalp2 = 0
    #for each coord, take the absolute difference
    for x in range(0, len(dot1)):
        totalp1 += dot1[x]
        totalp2 += dot2[x]

    return abs(totalp1-totalp2)

abs()来电很重

Answer 1

有许多指南可用于描述python＆＃34 ;;你应该搜索一些，阅读它们，然后逐步完成分析过程，以确保你知道你工作的哪些部分花费的时间最多。

但如果这确实是你工作的核心，那么calc_distance是大部分运行时间消耗的地方，这是一个公平的赌注。

深度优化可能需要使用NumPy加速数学或类似的低级方法。

作为一种快速而肮脏的方法，需要较少的入侵性分析和重写，请尝试安装Python的PyPy实现并在其下运行。与标准（CPython）实现相比，我已经看到了2倍或更多的加速度。

Answer 2

我很困惑。你有没有尝试过探查器？

 python -m cProfile myscript.py

它将显示大部分时间消耗的位置并提供可供使用的硬数据。例如。重构以减少调用次数，重组输入数据，替换此函数等等。

https://docs.python.org/3/library/profile.html

Answer 3

首先，您应该避免使用单个calc_distance函数在每次调用的字符串列表中执行线性搜索。定义独立的距离函数并调用正确的距离函数。正如Lee Daniel Crocker建议的那样，不要使用切片，只需将循环范围设为1即可。

对于余弦距离，我建议对所有点矢量进行一次归一化。这样，距离计算减少为点积。

这些微优化可以为您带来一些加速。但是，通过切换到更好的算法，可以获得更好的增益：kNN分类器需要kD-tree，这样您就可以快速删除大部分的点。

这更难实现（你必须稍微适应不同的距离;余弦距离会使它变得棘手。）

我需要一些帮助来优化python代码

3 个答案: