tup列表是巨大数据集的子集。我一直在尝试使用多线程来减少计算时间。但是dfsi列表给出的结果为空吗?
dfsi = list[]
tup = [(28075,69),(28075,72),(28075,73),(28075,76),(28075,96),(28075,99),
(28075,102),(28075,103),(28075,162),(28075,165)]
from multiprocessing.pool import ThreadPool
def multi_processing_tuples(sku,ids):
Q0 = np.percentile(df[((df['sku'] == sku) & (df['ids'] == ids)), 0)
Q4 = np.percentile(df[((df['sku'] == sku) & (df['ids'] == ids))], 100)
dfsi.append((sku,ids,Q0,Q4))
pool_size = 5
pool = ThreadPool(pool_size)
for (sku,ids) in tup:
pool.apply_async(multi_processing_tuples, ((sku,ids),))
pool.close()
pool.join()
编辑:
dfsi = list[]
tup = [(28075,69),(28075,72),(28075,73),(28075,76),(28075,96),(28075,99),
(28075,102),(28075,103),(28075,162),(28075,165)]
from multiprocessing.pool import ThreadPool
def multi_processing_tuples(sku,ids):
Q0 = np.percentile(df[((df['sku'] == sku) & (df['ids'] == ids)), 0)
Q4 = np.percentile(df[((df['sku'] == sku) & (df['ids'] == ids))], 100)
return(sku,ids,Q0,Q4)
pool_size = 5
pool = ThreadPool(pool_size)
for (sku,ids) in tup:
dfsi.append(pool.apply_async(multi_processing_tuples, ((sku,ids),)))
pool.close()
pool.join()
我正在得到dfsi输出。
[<multiprocessing.pool.ApplyResult at 0x1f707d7d9b0>,
<multiprocessing.pool.ApplyResult at 0x1f707d7d748>,
<multiprocessing.pool.ApplyResult at 0x1f707d7d710>,
<multiprocessing.pool.ApplyResult at 0x1f707d7dda0>,
<multiprocessing.pool.ApplyResult at 0x1f707d8e0f0>,
<multiprocessing.pool.ApplyResult at 0x1f707d8e358>,
<multiprocessing.pool.ApplyResult at 0x1f707d8e320>,
<multiprocessing.pool.ApplyResult at 0x1f707d8e6a0>,
<multiprocessing.pool.ApplyResult at 0x1f707d936d8>,
<multiprocessing.pool.ApplyResult at 0x1f707d93eb8>]
如何查看实际输出?
答案 0 :(得分:0)
生成新线程时,共享的是原始进程中的数据,但是当您尝试更改该数据时,它将被复制。关闭该线程时,并不是在隐式复制任何内容。您需要明确返回结果,然后在父级中处理它们。
def multi_processing_tuples(skid):
sku,ids = skid
Q0 = np.percentile(df[((df['sku'] == sku) & (df['ids'] == ids)), 0)
Q4 = np.percentile(df[((df['sku'] == sku) & (df['ids'] == ids))], 100)
return (sku,ids,Q0,Q4)
for data in pool.imap(multi_processing_tuples,tup):
dfsi.append(data)
这样做会从multi_processing_tuples返回数据,但是您可能还应该将df
作为参数传递。
编辑:而且,您通常不应该为此使用线程;如果要改善CPU密集型进程的运行时,则应使用进程池。线程将有助于IO密集型进程。