Question

我有一个大df（大约几百万行），我想从中读取数据以填充几个np.arrays。我想将其并行化，因为这需要很长时间，而且我认为这样做很简单，因为它是对单个df的只读操作。

以前，我是依靠一对词典来进行查找的，但是现在我实际上需要诸如query()和groupby()之类的附加功能。

我有一个自定义对象，将其称为df_interface，它处理与df的通信，以便df本身不会在不同的进程之间复制（导致内存问题）。我尝试了两种方法：

with ProcessPoolExecutor(args.nthreads) as executor:
    logging.info(f"Initiating the threadpool with {executor.__getattribute__('_max_workers')} threads")
    i = 0
    for short_row, long_row, is_hit, old_score, new_score in executor.map(_populate, train_combos):
        try:
            sums_arr[i] = short_row
            full_arr[i] = long_row
            target[i] = is_hit
            o_score[i] = old_score
            n_score[i] = new_score
            i += 1
            logging.info(f"Iter {i} done")

        except ValueError as e:
            logging.error(traceback.format_exc())
            logging.error(short_row, long_row, is_hit, old_score, new_score)
            input("Hit any key to continue")

在上述情况下，执行速度非常慢，大部分时间（87％）用于acquire的{{1}}方法。我猜想对df_interface的访问导致了所有等待。

请确保我尝试消除对多处理的调用，并全部在一个进程中完成。它比以前更快，但仍不够实用。

_thread.lock

我从分析器中看到，狮子所占的份额（70％）用于数据帧的# trying the same without a threadpool logging.info(f"Initiating without an executor") i = 0 for short_row, long_row, is_hit, old_score, new_score in map(_populate, train_combos): try: sums_arr[i] = short_row full_arr[i] = long_row target[i] = is_hit o_score[i] = old_score n_score[i] = new_score i += 1 logging.info(f"Iter {i} done") except ValueError as e: logging.error(traceback.format_exc()) logging.error(short_row, long_row, is_hit, old_score, new_score) input("Hit any key to continue") sys.exit(0)方法，该方法被称为21435次。据我所知，这特别奇怪，每个处理的元素的调用次数不应超过5-6次，而且我已经测试了10个元素（完整数据超过10万个元素）

是否有解决此问题的好方法？否则在这里使用大熊猫数据框是错误的方法吗？

在单个熊猫数据帧上并行执行只读操作

0 个答案: