分组算法速度问题-随机交换熊猫行很慢

时间:2019-01-02 15:59:31

标签: python pandas algorithm sorting

我开发了一个应用程序,可以帮助克服工作中遇到的常见问题,并提高输出质量并节省大量时间。一些员工的任务是将受训者组(通常总共约25至60人)分成几个子组(3-5)。我们希望这些子组在几个受训者属性之间达到最佳平衡。为简单起见,最重要的属性是培训前测验分数,性别和原籍国的结果。

我的应用程序将每个子组存储为pandas数据框,因此,如果有30个受训者,则每个子组数据框的形状为(9,3),其中受训者属性和唯一的受训者标识符为列。用于分组的算法使用效用函数(具有用户定义的属性权重)和随机分配来找到平衡的子组。在每次迭代中,将每个子组的随机成员与相邻子组的随机成员交换,并计算效用函数以返回分数。如果某个特定的分组返回了某行中一定数量的交换的最低分数(在我的示例200中),则该分组被认为是最佳的。将整个过程重复指定的次数(例如5次),以找到最佳分组并避免局部最小值的问题。这是实用函数和算法交换部分的代码:

###utility function
    def score_diff(first_df,second_df,third_df): # desired scores determined by overall group
        return (w_1*(abs(first_df['Grade/10.00'].mean() - desired_testscore)) + w_2*(abs(first_df['Female'].mean()-desired_genderbal)) + w_3*(abs(first_df['Country'].nunique() - opt_countrybal))) +\
           (w_1*(abs(second_df['Grade/10.00'].mean() - desired_testscore)) + w_2*(abs(second_df['Female'].mean()-desired_genderbal)) + w_3*(abs(second_df['Country'].nunique() - opt_countrybal))) +\
           (w_1*(abs(third_df['Grade/10.00'].mean() - desired_testscore)) + w_2*(abs(third_df['Female'].mean()-desired_genderbal)) + w_3*(abs(third_df['Country'].nunique() - opt_countrybal)))

### Swapping
if num_groups == 3:
    for i in range(5): # repeat the algorithm 5 times to avoid local minimums
        score_top = score_diff(pd_1,pd_2,pd_3)
        pd_1_old,pd_2_old,pd_3_old = pd_1.copy(),pd_2.copy(),pd_3.copy()
        pd_1.reset_index(drop=True,inplace=True)
        pd_2.reset_index(drop=True,inplace=True)
        pd_3.reset_index(drop=True,inplace=True)
        cont = True
        counter=0
        while cont:
            rand_num1 = np.random.randint(low = 0, high = len(pd_1))
            rand_num2 = np.random.randint(low = 0, high = len(pd_2))
            pd_1.loc[rand_num1,:], pd_2.loc[rand_num2,:] = pd_2.iloc[rand_num2,:].copy(), pd_1.iloc[rand_num1,:].copy()
            rand_num2 = np.random.randint(low = 0, high = len(pd_2))
            rand_num3 = np.random.randint(low = 0, high = len(pd_3))
            pd_2.iloc[rand_num2,:],pd_3.iloc[rand_num3,:] = pd_3.iloc[rand_num3,:].copy(), pd_2.iloc[rand_num2,:].copy()
            rand_num3 = np.random.randint(low = 0, high = len(pd_3))
            rand_num1 = np.random.randint(low = 0, high = len(pd_1))
            pd_3.iloc[rand_num3,:],pd_1.iloc[rand_num1,:] = pd_1.iloc[rand_num1,:].copy(), pd_3.iloc[rand_num3,:].copy()
            score_new = score_diff(pd_1,pd_2,pd_3)
            if score_new < score_top: 
                score_top = score_new
                pd_1_old,pd_2_old,pd_3_old = pd_1.copy(),pd_2.copy(),pd_3.copy()
                counter = 0 
            else: 
                counter+=1 
                if counter > sort_length:
                    pd_1_opt,pd_2_opt,pd_3_opt = pd_1_old.copy(),pd_2_old.copy(),pd_3_old.copy()
                    cont = False 
                else: 
                    continue
        if i != 0: 
            if score_diff(pd_1_old,pd_2_old,pd_3_old) < score_diff(pd_1_opt,pd_2_opt,pd_3_opt):
                pd_1_opt,pd_2_opt,pd_3_opt = pd_1_old.copy(),pd_2_old.copy(),pd_3_old.copy()

出于我的目的,该算法有效,应用程序有效地返回了平衡的子组。唯一的问题是程序速度,在这里给出我的示例,通常需要20秒钟至一分钟的时间来运行,这取决于要排序和受训的受训人数。这不是灾难性的,但是考虑到问题的严重性,我相信可以更快地运行它。毫不奇怪,对该算法的快速检查表明,大多数速度问题都来自子组之间的交换,再次在此处显示:

            rand_num1 = np.random.randint(low = 0, high = len(pd_1))
            rand_num2 = np.random.randint(low = 0, high = len(pd_2))
            pd_1.loc[rand_num1,:], pd_2.loc[rand_num2,:] = pd_2.iloc[rand_num2,:].copy(), pd_1.iloc[rand_num1,:].copy()
            rand_num2 = np.random.randint(low = 0, high = len(pd_2))
            rand_num3 = np.random.randint(low = 0, high = len(pd_3))
            pd_2.iloc[rand_num2,:],pd_3.iloc[rand_num3,:] = pd_3.iloc[rand_num3,:].copy(), pd_2.iloc[rand_num2,:].copy()
            rand_num3 = np.random.randint(low = 0, high = len(pd_3))
            rand_num1 = np.random.randint(low = 0, high = len(pd_1))
            pd_3.iloc[rand_num3,:],pd_1.iloc[rand_num1,:] = pd_1.iloc[rand_num1,:].copy(), pd_3.iloc[rand_num3,:].copy()

对于我的需求,随机分配算法就足够了,我没有计划对其进行更改。但是,我希望我能对如何克服熊猫数据帧中的行的随机交换所带来的速度问题提出一些建议。正是这些行中数据的交换引入了速度问题。也许有更好的方法可以在当前算法的背景下完成此操作?

任何克服这一挑战的建议将不胜感激!

0 个答案:

没有答案