python2.7 + multiprocessing + selenium:异常时重启进程

时间:2014-02-26 00:27:25

标签: python selenium selenium-webdriver multiprocessing phantomjs

我似乎遇到了使用多处理的python脚本的问题。它本质上是做一个ID代码列表,并启动使用Selenium和PhantomJS作为驱动程序的进程导航到包含该ID代码的URL,将数据提取到单个csv文件,然后在所有进程完成后编译另一个csv文件。一切都运行得很好,除非有时其中一个进程会返回一个异常,说明:

Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "modtest.py", line 11, in worker
    do_work(item)
  File "/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/items.py", line 14, in do_work
    driver = webdriver.PhantomJS()
  File "/usr/lib/python2.7/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 50, in __init__
    self.service.start()
  File "/usr/lib/python2.7/site-packages/selenium/webdriver/phantomjs/service.py", line 72, in start
    raise WebDriverException("Can not connect to GhostDriver")

如果引发异常,我已经尝试过重新启动过程的方法,但无论如何,似乎正在发生的事情是,一旦进程完成,程序挂起并且不继续,或者为此做任何事情物。如果进程崩溃,我本质上想要重新启动正在搜索的ID号,并在所有进程完成时继续。这是代码的极简版:

from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup as bs
import multiprocessing
import datetime, time, csv, glob


num_procs = 8

def do_work(rsrt):

        driver = webdriver.PhantomJS()

        try:
            driver.get('http://www.example.com/get.php?resort=' + rsrt)

            rows = []

            for row in soup.find_all('tr'):
                if row.find('input', {'name': 'booksubmit'}):
                    wyncheckin = row.find('td', {'class': 'searchAvailDate'}).string
                    wynnights = row.find('td', {'class': 'searchAvailNights'}).string
                    wynroom = row.find('td', {'class': 'searchAvailUnitType'}).string
                    rows.append([wynresort, wyncheckin, wynroom])


            driver.quit()

            with open('/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/availability/'+rsrt+'.csv', 'wb') as f:
                writer = csv.writer(f)
                writer.writerows(row for row in rows if row)

            print 'Process ' + rsrt + ' End: ' + str(time.strftime('%c'))


        except:
            driver.quit()



def worker():
    for item in iter( q.get, None ):
        do_work(item)
        q.task_done()
    q.task_done()


q = multiprocessing.JoinableQueue()

procs = []

for i in range(num_procs):
    procs.append( multiprocessing.Process(target=worker) )
    procs[-1].daemon = True
    procs[-1].start()

source = ['0017', '0113', '0020', '0013', '0038', '1028', '0115', '0105', '0041', '0037', '0043', '2026', '0165', '0164',
        '0033', '0126', '0116', '0103', '9135', '0185', '0206', '0053', '0062', '1020', '0019', '0042', '2028', '0213',
        '0211', '0163', '0073', '2020', '0214', '2140', '0084', '0193', '0095', '0064', '0196', '0028', '0068', '0074']

for item in source:
    q.put(item)

q.join()

for p in procs:
    q.put( None )

q.join()

for p in procs:
    p.join()

print "Finished"
print 'Writting core output: ' + str(time.strftime('%c'))
with open('availability.csv', 'wb') as outfile:
    for csvfile in glob.glob('/home/mdrouin/Dropbox/Work/Dev/Python/WynInvScrape/availability/*.csv'):
        for line in open(csvfile, 'r'):
            outfile.write(line)

print 'Process End: ' + str(time.strftime('%c'))

1 个答案:

答案 0 :(得分:1)

解决此类问题的方法之一是对自身进行反复调用,其中包含以下内容:

def do_work(rsrt):
    if failed:
        return do_work(rsrt)

当然这会一直运行直到它结算,所以你可能想要传递一个计数器,如果它高于某个值,则返回false。