我有一个大df(大约几百万行),我想从中读取数据以填充几个np.arrays。我想将其并行化,因为这需要很长时间,而且我认为这样做很简单,因为它是对单个df的只读操作。
以前,我是依靠一对词典来进行查找的,但是现在我实际上需要诸如query()
和groupby()
之类的附加功能。
我有一个自定义对象,将其称为df_interface,它处理与df的通信,以便df本身不会在不同的进程之间复制(导致内存问题)。我尝试了两种方法:
with ProcessPoolExecutor(args.nthreads) as executor:
logging.info(f"Initiating the threadpool with {executor.__getattribute__('_max_workers')} threads")
i = 0
for short_row, long_row, is_hit, old_score, new_score in executor.map(_populate, train_combos):
try:
sums_arr[i] = short_row
full_arr[i] = long_row
target[i] = is_hit
o_score[i] = old_score
n_score[i] = new_score
i += 1
logging.info(f"Iter {i} done")
except ValueError as e:
logging.error(traceback.format_exc())
logging.error(short_row, long_row, is_hit, old_score, new_score)
input("Hit any key to continue")
在上述情况下,执行速度非常慢,大部分时间(87%)用于acquire
的{{1}}方法。我猜想对df_interface的访问导致了所有等待。
请确保我尝试消除对多处理的调用,并全部在一个进程中完成。它比以前更快,但仍不够实用。
_thread.lock
我从分析器中看到,狮子所占的份额(70%)用于数据帧的# trying the same without a threadpool
logging.info(f"Initiating without an executor")
i = 0
for short_row, long_row, is_hit, old_score, new_score in map(_populate, train_combos):
try:
sums_arr[i] = short_row
full_arr[i] = long_row
target[i] = is_hit
o_score[i] = old_score
n_score[i] = new_score
i += 1
logging.info(f"Iter {i} done")
except ValueError as e:
logging.error(traceback.format_exc())
logging.error(short_row, long_row, is_hit, old_score, new_score)
input("Hit any key to continue")
sys.exit(0)
方法,该方法被称为21435次。据我所知,这特别奇怪,每个处理的元素的调用次数不应超过5-6次,而且我已经测试了10个元素(完整数据超过10万个元素)
是否有解决此问题的好方法?否则在这里使用大熊猫数据框是错误的方法吗?