我想在给定文章中搜索预定义的关键字列表,如果在文章中找到关键字,则将得分增加1。我想使用多处理,因为预定义的关键字列表非常大--10k关键字和文章数量是100k。
我遇到了this问题,但它没有解决我的问题。
我尝试了此实现,但结果是None
。
keywords = ["threading", "package", "parallelize"]
def search_worker(keyword):
score = 0
article = """
The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
if keyword in article:
score += 1
return score
我在下面尝试了两种方法,但结果却得到了三个None
。
方法一:
pool = mp.Pool(processes=4)
result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]
方法2:
result = pool.map(search_worker, keywords)
print(result)
实际输出: [无,无,无]
预期输出: 3
我想向工作人员发送预定义的关键字列表和文章,但我不确定我是否正在朝着正确的方向前进,因为我之前没有多处理经验。
提前致谢。
答案 0 :(得分:1)
这是使用Pool
的功能。您可以传递text和keyword_list,它会起作用。您可以使用Pool.starmap
传递(text, keyword)
的元组,但是您需要处理一个对text
有10k引用的迭代。
from functools import partial
from multiprocessing import Pool
def search_worker(text, keyword):
return int(keyword in text)
def parallel_search_text(text, keyword_list):
processes = 4
chunk_size = 10
total = 0
func = partial(search_worker, text)
with Pool(processes=processes) as pool:
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
total += result
return total
if __name__ == '__main__':
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text(text, keywords))
创建工作池会产生开销。对于简单的单进程文本搜索功能,可能值得进行时间测试。通过创建Pool
的一个实例并将其传递给函数,可以加快重复调用。
def parallel_search_text2(text, keyword_list, pool):
chunk_size = 10
results = 0
func = partial(search_worker, text)
for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
results += result
return results
if __name__ == '__main__':
pool = Pool(processes=4)
texts = [] # a list of texts
keywords = [] # a list of keywords
for text in texts:
print(parallel_search_text2(text, keywords, pool))
答案 1 :(得分:0)
用户e.s
已在其评论中解决了主要问题,但我发布了Om Prakash
评论请求传入的解决方案:
文章和预定义的关键字工作方法列表
这是一种简单的方法。您需要做的就是构造一个元组,其中包含您希望工作者处理的参数:
from multiprocessing import Pool
def search_worker(article_and_keyword):
# unpack the tuple
article, keyword = article_and_keyword
# count occurrences
score = 0
if keyword in article:
score += 1
return score
if __name__ == "__main__":
# the article and the keywords
article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
keywords = ["threading", "package", "parallelize"]
# construct the arguments for the search_worker; one keyword per worker but same article
args = [(article, keyword) for keyword in keywords]
# construct the pool and map to the workers
with Pool(3) as pool:
result = pool.map(search_worker, args)
print(result)
如果您使用的是更高版本的python,我建议您尝试使用starmap
,因为这会让它更清洁。