我想在执行时间方面优化我的代码。代码在包含大约300,000个条目的数据帧alldata
上运行,但计算需要很长时间(大约10个小时左右)。
计算的逻辑如下:
对于列表list_of_NA_features
中指定的数据帧列的每个缺失(nan)值,函数fill_missing_values
搜索最相似的行(余弦相似度基于列中的列计算)列出永远不会为空的list_of_non_nan_features
并返回alldata
中当前列和行的值。
from scipy import spatial
def fill_missing_values(param_nan,current_row,df):
df_non_nan = df.dropna(subset=[param_nan])
list_of_non_nan_features = ["f1","f2","f3","f4","f5"]
max_val = 0
searched_val = 0
vector1 = current_row[list_of_non_nan_features].values
for index, row in df_non_nan.iterrows():
vector2 = row[list_of_non_nan_features].values
sim = 1 - spatial.distance.cosine(vector1, vector2)
if (sim>max_val):
max_val = sim
searched_val = row[param_nan]
return searched_val
list_of_NA_features = df_train.columns[df_train.isnull().any()]
for feature in list_of_NA_features:
for index,row in alldata.iterrows():
if (pd.isnull(row[feature]) == True):
missing_value = fill_missing_values(feature,row,alldata)
alldata.ix[index,feature] = missing_value
是否可以优化代码?例如,我正在考虑使用for
函数替换lambda
循环。有可能吗?
答案 0 :(得分:1)
不要用lambdas
替换你的for循环,而是尝试用ufuncs.替换它们
Losing Your Loops: Fast Numerical Computation with Numpy是杰克范德普拉斯关于这个主题的精彩演讲。 使用通用函数和广播而不是for循环可以显着提高代码的速度。
这是一个基本的例子:
import numpy as np
from time import time
def timed(func):
def inner(*args, **kwargs):
t0 = time()
result = func(*args, **kwargs)
elapsed = time()-t0
print(f'ran {func.__name__} in {elapsed} seconds)')
return result
return inner
# without broadcasting:
@timed
def sums():
sums = np.zeros([500, 500])
for a in range(500):
for b in range(500):
sums[a, b] = a+b
return sums
@timed
def sums_broadcasted():
a = np.arange(500)
b = np.reshape(np.arange(500), [500, 1])
return a+b
INPUT:
sums()
sums_broadcasted()
assert (a==b).all()
输出:
ran sums in 0.030008554458618164 seconds
ran sums_broadcasted in 0.0005011558532714844 seconds
注意,通过消除我们的循环,我们有60倍的加速!