我似乎遇到了使用多处理的python脚本的问题。它本质上是做一个ID代码列表,并启动使用Selenium和PhantomJS作为驱动程序的进程导航到包含该ID代码的URL,将数据提取到单个csv文件,然后在所有进程完成后编译另一个csv文件。一切都运行得很好,除非有时其中一个进程会返回一个异常,说明:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "modtest.py", line 11, in worker
do_work(item)
File "/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/items.py", line 14, in do_work
driver = webdriver.PhantomJS()
File "/usr/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
self.service.start()
File "/usr/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 72, in start
raise WebDriverException("Can not connect to GhostDriver")
如果引发异常,我已经尝试过重新启动过程的方法,但无论如何,似乎正在发生的事情是,一旦进程完成,程序挂起并且不继续,或者为此做任何事情物。如果进程崩溃,我本质上想要重新启动正在搜索的ID号,并在所有进程完成时继续。这是代码的极简版:
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup as bs
import multiprocessing
import datetime, time, csv, glob
num_procs = 8
def do_work(rsrt):
driver = webdriver.PhantomJS()
try:
driver.get('http://www.example.com/get.php?resort=' + rsrt)
rows = []
for row in soup.find_all('tr'):
if row.find('input', {'name': 'booksubmit'}):
wyncheckin = row.find('td', {'class': 'searchAvailDate'}).string
wynnights = row.find('td', {'class': 'searchAvailNights'}).string
wynroom = row.find('td', {'class': 'searchAvailUnitType'}).string
rows.append([wynresort, wyncheckin, wynroom])
driver.quit()
with open('/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/availability/'+rsrt+'.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(row for row in rows if row)
print 'Process ' + rsrt + ' End: ' + str(time.strftime('%c'))
except:
driver.quit()
def worker():
for item in iter( q.get, None ):
do_work(item)
q.task_done()
q.task_done()
q = multiprocessing.JoinableQueue()
procs = []
for i in range(num_procs):
procs.append( multiprocessing.Process(target=worker) )
procs[-1].daemon = True
procs[-1].start()
source = ['0017', '0113', '0020', '0013', '0038', '1028', '0115', '0105', '0041', '0037', '0043', '2026', '0165', '0164',
'0033', '0126', '0116', '0103', '9135', '0185', '0206', '0053', '0062', '1020', '0019', '0042', '2028', '0213',
'0211', '0163', '0073', '2020', '0214', '2140', '0084', '0193', '0095', '0064', '0196', '0028', '0068', '0074']
for item in source:
q.put(item)
q.join()
for p in procs:
q.put( None )
q.join()
for p in procs:
p.join()
print "Finished"
print 'Writting core output: ' + str(time.strftime('%c'))
with open('availability.csv', 'wb') as outfile:
for csvfile in glob.glob('/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/availability/*.csv'):
for line in open(csvfile, 'r'):
outfile.write(line)
print 'Process End: ' + str(time.strftime('%c'))
答案 0 :(得分:1)
解决此类问题的方法之一是对自身进行反复调用,其中包含以下内容:
def do_work(rsrt):
if failed:
return do_work(rsrt)
当然这会一直运行直到它结算,所以你可能想要传递一个计数器,如果它高于某个值,则返回false。