Question

我目睹了PyCharm的一些奇怪的运行时问题，下面将对此进行解释。该代码已在具有20个内核和256 GB RAM的计算机上运行，并且有足够的可用内存。我没有展示任何实际功能，因为它是一个相当大的项目，但是我很乐意根据要求添加详细信息。

简而言之，我有一个具有以下结构的.py文件项目：

import ...
import ...

cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)

def collect_results(result_list):
    return pd.DataFrame({'start_time': result_list[0::4],
                  'arrival_time': result_list[1::4],
                  'tour_id': result_list[2::4],
                  'trip_id': result_list[3::4]})

if __name__ == '__main__':

    # Run the serial code
    st = starttimes.StartTimesCreate(prng)
    temp_df, two_trips_df, time_dist_arr = st.run()

     # Prepare the dataframe to sample start times. Create groups from the input dataframe
    temp_df1 = st.prepare_two_trips_more_df(temp_df, two_trips_df)
    validation.logger.info("Dataframe prepared for multiprocessing")

    grp_list = []
    for name, group in temp_df1.groupby('tour_id'):  ### problem lies here in runtimes
        grp_list.append(group)
    validation.logger.info("All groups have been prepared for multiprocessing, "
                           "for a total of %s groups" %len(grp_list))

################ PARALLEL CODE BELOW #################

for循环在具有1,050万行和18列的数据帧上运行。在当前表单中，创建组列表（280万个组）大约需要 25分钟。创建这些组，然后将其馈送到多进程池，该多进程池的代码未显示。

我花了25分钟的时间很长，因为我也进行了以下测试，只需 7分钟。本质上，我将 temp_df1 文件保存为CSV，然后在预先保存的文件中进行批处理，并像以前一样运行相同的 for循环。

import ...
import ...

cpu_cores = control_parameters.cpu_cores
prng = RandomState(123)

def collect_results(result_list):
    return pd.DataFrame({'start_time': result_list[0::4],
                  'arrival_time': result_list[1::4],
                  'tour_id': result_list[2::4],
                  'trip_id': result_list[3::4]})

if __name__ == '__main__':

    # Run the serial code
    st = starttimes.StartTimesCreate(prng)

    temp_df1 = pd.read_csv(r"c:\\...\\temp_df1.csv")
    time_dist = pd.read_csv(r"c:\\...\\start_time_distribution_treso_1.csv")
    time_dist_arr = np.array(time_dist.to_records())

    grp_list = []
    for name, group in temp_df1.groupby('tour_id'):
        grp_list.append(group)
    validation.logger.info("All groups have been prepared for multiprocessing, "
                           "for a total of %s groups" %len(grp_list))

问题那么，是什么原因导致我只批处理文件时代码运行速度比将文件作为更上游函数的一部分创建时快3倍？

预先感谢，请让我知道如何进一步澄清。

Answer 1

我正在回答我的问题，因为我在做一堆测试时偶然发现了答案，并且庆幸的是，当我在解决方案中搜索时，其他人也有相同的issue。可以在上面的链接中找到有关为什么在进行group_by操作时使用分类列是一个坏主意的说明。因此，我不会在这里发布它。谢谢。

Python：Pycharm运行时

1 个答案: