Question

我正在编写一个函数，该函数计算训练实例与测试实例之间的距离。距离已修改为曼哈顿距离。当功能（列）数量增加时，我的工作代码会变得太慢。知道我该如何加快速度吗？

import pandas as pd
import numpy as np
import time
import datetime


def make_random_distance():
    """Generates randomly populated pandas dataframe of a training dataset and a test dataset and caclulates and ret"""
    df=pd.DataFrame(np.random.randint(0,50,size=(10000,1024)))
    print(df.shape)

    #Test dataset
    test=pd.DataFrame(np.random.randint(0,50,size=(1,1024)))


    Calculated_Distances=[]
    #For each test instance
    for ind,roll in test.iterrows():
        print("Working on test instance {}".format(ind))
        #print(ind,roll.values)
        Test_inst = np.array(roll.values) #Features of test instance
        #Dist = custom_distance_b(Test_inst, df)
        Dist = custom_distance(Test_inst, df)
        print("Done calculating distances")

        print("Now sorting dictionary")
        sorted_d = sorted(Dist.items(), key=operator.itemgetter(1))

        # Now we examine the 5NN
        for j in range(5):
            index_com = sorted_d[j][0]
            calc_dist = sorted_d[j][1]
            Calculated_Distances.append([ind, index_com, calc_dist])

    #writes out results
    Calc_Dist=pd.DataFrame(Calculated_Distances,columns=['Test_indx','Training_indx','Distance'])
    #Calc_Dist.to_csv("/home/Code/testing_distances.csv",sep=',',index=False)
    print(Calc_Dist)

    return


def custom_distance(i,df):
    """
    :param i: test instance vector
    :param df: training instances pandas data frame
    :return:
    """

    #First we need to caclulate the standard deviation for each descriptor (row)

    # First caclulate standard deviations for each column (feature) 
    count_ind = 0
    stad_dev = {}
    for column in df:
        stad_dev[count_ind] = df.iloc[:, column].std(axis=0)
        count_ind+=1

    Dist={}
    for index,row in df.iterrows():
        temp_dist=0
        for j in range(len(row)):
            dist=float(abs(row[j]-i[j])/(5*stad_dev[j]))
            temp_dist+=min(dist,1.0)
        #print(index,i.values,row.values,temp_dist)
        Dist[index]=round(temp_dist,3)


    return Dist


if __name__=="__main__":
    T1=time.time()
    make_random_distance()
    T2=time.time()
    t=T2-T1
    print("Took {} seconds".format(t))
    print("Took {}".format(str(datetime.timedelta(seconds=t))))

当前代码在我的计算机上针对单个测试实例计算带有1024个特征/列的10000个训练实例的距离并检索5个最近的邻居。

花费128.5559959411621秒参加了0：02：08.555996

有什么想法可以加快速度吗？因为我将需要在测试集上计算数千个此类计算。

Answer 1

您可以通过使用最小堆Algorithm to find k smallest numbers in array of n items来减少查找前5个排序时间。

可能需要考虑的其他事情是，您的自定义距离本质上是基于列的stdev，这不会有很大的偏差，因为您有很多样本。您的stdev几乎永远不会从14-15开始变化。这意味着，如果您愿意，可以将所有值保存在单个数组中，跟踪测试值在该数组中的位置，然后从该值上下移动以找到最接近的距离，并运行自定义距离函数在那些个人距离之后，成功的可能性极高。这会将您的运行时间从O（n ^ 3）更改为O（nlogn）

计算n维实例之间的自定义距离

1 个答案: