我在新闻抓取脚本的并行处理上遇到了麻烦。
我有以下脚本,该脚本读取google news rss页面并处理返回的每个链接。 news_list
是BeautifulSoup元素,其中包含有关某个主题的10条最新新闻的信息。
def main(news_list):
news_list = soup_page.findAll("item")
feed = []
for article in news_list[:10]:
new = {}
new['title'] = article.title.text
new['source'] = article.source.text
new['link'] = article.link.text
new['date'] = datetime.strptime(article.pubDate.text, '%a, %d %b %Y %H:%M:%S %Z')
new['keywords'] = keywords(article.link.text)
feed.append(new)
函数keywords
处理新闻内容并返回重要关键字。每个新闻文章此功能大约需要1.5秒,因此完整脚本至少需要15秒才能运行。
我想减少脚本的持续时间,所以我一直在尝试多处理而不是for循环,例如:
def process_article(article):
new = {}
new['title'] = article.title.text
new['source'] = article.source.text
new['link'] = article.link.text
new['date'] = datetime.strptime(article.pubDate.text, '%a, %d %b %Y %H:%M:%S %Z')
new['keywords'] = keywords(article.link.text)
return new
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
feed = Parallel(n_jobs=num_cores)(delayed(process_news)(article) for article in news_list[:10])
但是,我收到错误消息,好像函数process_article是递归的:
RecursionError: maximum recursion depth exceeded while calling a Python object
我在做什么错?如果我按如下所示编写函数,仍然会发生这种情况,因此keywords
函数不是问题
def process_article(article):
new = {}
return new
感谢您的帮助。谢谢!
这是完整的追溯:
RecursionError Traceback (most recent call last)
<ipython-input-90-498afb9f1a25> in <module>
1 num_cores = multiprocessing.cpu_count()
2
----> 3 results = Parallel(n_jobs=num_cores)(delayed(process_news)(article) for article in list(news_list[:10]))
/usr/local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
787 # consumption.
788 self._iterating = False
--> 789 self.retrieve()
790 # Make sure that we get a last message telling us we are done
791 elapsed_time = time.time() - self._start_time
/usr/local/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
697 try:
698 if getattr(self._backend, 'supports_timeout', False):
--> 699 self._output.extend(job.get(timeout=self.timeout))
700 else:
701 self._output.extend(job.get())
/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py in get(self, timeout)
642 return self._value
643 else:
--> 644 raise self._value
645
646 def _set(self, i, obj):
/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
422 break
423 try:
--> 424 put(task)
425 except Exception as e:
426 job, idx = task[:2]
/usr/local/lib/python3.6/site-packages/joblib/pool.py in send(obj)
369 def send(obj):
370 buffer = BytesIO()
--> 371 CustomizablePickler(buffer, self._reducers).dump(obj)
372 self._writer.send_bytes(buffer.getvalue())
373 self._send = send
RecursionError: maximum recursion depth exceeded while calling a Python object