根据执行时间优化嵌套的for循环

时间:2017-05-06 19:31:16

标签: python pandas

我想在执行时间方面优化我的代码。代码在包含大约300,000个条目的数据帧alldata上运行,但计算需要很长时间(大约10个小时左右)。

计算的逻辑如下:

对于列表list_of_NA_features中指定的数据帧列的每个缺失(nan)值,函数fill_missing_values搜索最相似的行(余弦相似度基于列中的列计算)列出永远不会为空的list_of_non_nan_features并返回alldata中当前列和行的值。

from scipy import spatial

def fill_missing_values(param_nan,current_row,df):
    df_non_nan = df.dropna(subset=[param_nan])
    list_of_non_nan_features = ["f1","f2","f3","f4","f5"] 
    max_val = 0
    searched_val = 0
    vector1 = current_row[list_of_non_nan_features].values
    for index, row in df_non_nan.iterrows():
        vector2 = row[list_of_non_nan_features].values
        sim = 1 - spatial.distance.cosine(vector1, vector2)
        if (sim>max_val):
            max_val = sim
            searched_val = row[param_nan]
    return searched_val


list_of_NA_features = df_train.columns[df_train.isnull().any()]


for feature in list_of_NA_features:
    for index,row in alldata.iterrows():
        if (pd.isnull(row[feature]) == True):
            missing_value = fill_missing_values(feature,row,alldata)
            alldata.ix[index,feature] = missing_value

是否可以优化代码?例如,我正在考虑使用for函数替换lambda循环。有可能吗?

1 个答案:

答案 0 :(得分:1)

不要用lambdas替换你的for循环,而是尝试用ufuncs.替换它们

Losing Your Loops: Fast Numerical Computation with Numpy是杰克范德普拉斯关于这个主题的精彩演讲。 使用通用函数和广播而不是for循环可以显着提高代码的速度。

这是一个基本的例子:

import numpy as np
from time import time

def timed(func):
    def inner(*args, **kwargs):
        t0 = time()
        result = func(*args, **kwargs)
        elapsed = time()-t0
        print(f'ran {func.__name__} in {elapsed} seconds)')
        return result
    return inner
# without broadcasting:

@timed
def sums():
    sums = np.zeros([500, 500])
    for a in range(500):
        for b in range(500):
            sums[a, b] = a+b
    return sums

@timed
def sums_broadcasted(): 
    a = np.arange(500)
    b = np.reshape(np.arange(500), [500, 1])
    return a+b

INPUT:

sums()
sums_broadcasted()
assert (a==b).all()

输出:

ran sums in 0.030008554458618164 seconds
ran sums_broadcasted in 0.0005011558532714844 seconds

注意,通过消除我们的循环,我们有60倍的加速!