假设我有一个功能:
def support1(items, rows, output):
n_rows = len(rows)
if type(items) is list or type(items) is set:
count = float(sum([1 for row in rows if all(item in row.split() for item in items)]))
elif type(items) is str:
count = float(sum([1 for row in rows if all(item in row.split() for item in items.split())]))
res = count/n_rows
output.put(res)
我想在50,000对项目列表上运行此功能,如下所示:
all_items = [['apple', 'banana'], ['apple', 'fruit'],
['apple', 'pear'], ['banana', 'pear'], ...]
我要完成10,000笔交易:
transactions = ['apple banana pear peach cream', 'apple banana pear', 'pear apple apple banana', 'pear banana', 'banana', 'apple', ...]
所以,为了计算这些对的频率,我写了这样的东西:
supports = [support1(pair, transactions, output) for pair in all_items]
这显然需要我的机器(显然)。我无法将transactions
转换为set
。我正在尝试启动一些并行进程,但这些进程与counts
赋值理解一样长。这是我的并行代码:
import multiprocessing
output = mp.Queue()
processes = [multiprocessing.Process(target = support1, args = (pair, transactions, output)) for pair in all_items]
for p in processes:
p.start()
最后的for
循环是永远需要的...我错过了这个multiprocessing
模块的东西吗?我以前做过并行处理,并没有那么糟糕。