我正在编写一个函数,该函数计算训练实例与测试实例之间的距离。距离已修改为曼哈顿距离。当功能(列)数量增加时,我的工作代码会变得太慢。知道我该如何加快速度吗?
import pandas as pd
import numpy as np
import time
import datetime
def make_random_distance():
"""Generates randomly populated pandas dataframe of a training dataset and a test dataset and caclulates and ret"""
df=pd.DataFrame(np.random.randint(0,50,size=(10000,1024)))
print(df.shape)
#Test dataset
test=pd.DataFrame(np.random.randint(0,50,size=(1,1024)))
Calculated_Distances=[]
#For each test instance
for ind,roll in test.iterrows():
print("Working on test instance {}".format(ind))
#print(ind,roll.values)
Test_inst = np.array(roll.values) #Features of test instance
#Dist = custom_distance_b(Test_inst, df)
Dist = custom_distance(Test_inst, df)
print("Done calculating distances")
print("Now sorting dictionary")
sorted_d = sorted(Dist.items(), key=operator.itemgetter(1))
# Now we examine the 5NN
for j in range(5):
index_com = sorted_d[j][0]
calc_dist = sorted_d[j][1]
Calculated_Distances.append([ind, index_com, calc_dist])
#writes out results
Calc_Dist=pd.DataFrame(Calculated_Distances,columns=['Test_indx','Training_indx','Distance'])
#Calc_Dist.to_csv("/home/Code/testing_distances.csv",sep=',',index=False)
print(Calc_Dist)
return
def custom_distance(i,df):
"""
:param i: test instance vector
:param df: training instances pandas data frame
:return:
"""
#First we need to caclulate the standard deviation for each descriptor (row)
# First caclulate standard deviations for each column (feature)
count_ind = 0
stad_dev = {}
for column in df:
stad_dev[count_ind] = df.iloc[:, column].std(axis=0)
count_ind+=1
Dist={}
for index,row in df.iterrows():
temp_dist=0
for j in range(len(row)):
dist=float(abs(row[j]-i[j])/(5*stad_dev[j]))
temp_dist+=min(dist,1.0)
#print(index,i.values,row.values,temp_dist)
Dist[index]=round(temp_dist,3)
return Dist
if __name__=="__main__":
T1=time.time()
make_random_distance()
T2=time.time()
t=T2-T1
print("Took {} seconds".format(t))
print("Took {}".format(str(datetime.timedelta(seconds=t))))
当前代码在我的计算机上针对单个测试实例计算 带有1024个特征/列的10000个训练实例的距离 并检索5个最近的邻居。
花费128.5559959411621秒 参加了0:02:08.555996
有什么想法可以加快速度吗?因为我将需要在测试集上计算数千个此类计算。
答案 0 :(得分:0)
您可以通过使用最小堆Algorithm to find k smallest numbers in array of n items来减少查找前5个排序时间。
可能需要考虑的其他事情是,您的自定义距离本质上是基于列的stdev,这不会有很大的偏差,因为您有很多样本。您的stdev几乎永远不会从14-15开始变化。这意味着,如果您愿意,可以将所有值保存在单个数组中,跟踪测试值在该数组中的位置,然后从该值上下移动以找到最接近的距离,并运行自定义距离函数在那些个人距离之后,成功的可能性极高。这会将您的运行时间从O(n ^ 3)更改为O(nlogn)