在单个熊猫数据帧上并行执行只读操作

时间:2019-10-08 15:26:09

标签: python pandas dataframe concurrency multiprocessing

我有一个大df(大约几百万行),我想从中读取数据以填充几个np.arrays。我想将其并行化,因为这需要很长时间,而且我认为这样做很简单,因为它是对单个df的只读操作

以前,我是依靠一对词典来进行查找的,但是现在我实际上需要诸如query()groupby()之类的附加功能。

我有一个自定义对象,将其称为df_interface,它处理与df的通信,以便df本身不会在不同的进程之间复制(导致内存问题)。我尝试了两种方法:

with ProcessPoolExecutor(args.nthreads) as executor:
    logging.info(f"Initiating the threadpool with {executor.__getattribute__('_max_workers')} threads")
    i = 0
    for short_row, long_row, is_hit, old_score, new_score in executor.map(_populate, train_combos):
        try:
            sums_arr[i] = short_row
            full_arr[i] = long_row
            target[i] = is_hit
            o_score[i] = old_score
            n_score[i] = new_score
            i += 1
            logging.info(f"Iter {i} done")

        except ValueError as e:
            logging.error(traceback.format_exc())
            logging.error(short_row, long_row, is_hit, old_score, new_score)
            input("Hit any key to continue")

在上述情况下,执行速度非常慢,大部分时间(87%)用于acquire的{​​{1}}方法。我猜想对df_interface的访问导致了所有等待。

请确保我尝试消除对多处理的调用,并全部在一个进程中完成。它比以前更快,但仍不够实用。

_thread.lock

我从分析器中看到,狮子所占的份额(70%)用于数据帧的# trying the same without a threadpool logging.info(f"Initiating without an executor") i = 0 for short_row, long_row, is_hit, old_score, new_score in map(_populate, train_combos): try: sums_arr[i] = short_row full_arr[i] = long_row target[i] = is_hit o_score[i] = old_score n_score[i] = new_score i += 1 logging.info(f"Iter {i} done") except ValueError as e: logging.error(traceback.format_exc()) logging.error(short_row, long_row, is_hit, old_score, new_score) input("Hit any key to continue") sys.exit(0) 方法,该方法被称为21435次。据我所知,这特别奇怪,每个处理的元素的调用次数不应超过5-6次,而且我已经测试了10个元素(完整数据超过10万个元素)

是否有解决此问题的好方法?否则在这里使用大熊猫数据框是错误的方法吗?

0 个答案:

没有答案