运行python脚本Phantomjs和Selenium时出现超时问题

时间:2015-11-20 09:04:54

标签: python selenium phantomjs ghostdriver

我正在运行Phontomjs和Selenium的python脚本。我正面临超时问题。它在20-50分钟后停止。我需要一个解决方案,以便我可以运行我的脚本没有这个超时问题。问题在哪里,我该如何解决?

 The input file cannot be read or no in proper format.
    Traceback (most recent call last):
      File "links_crawler.py", line 147, in <module>
        crawler.Run()
      File "links_crawler.py", line 71, in Run
        self.checkForNextPages()
      File "links_crawler.py", line 104, in checkForNextPages
        self.next.click()
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 75, in click
        self._execute(Command.CLICK_ELEMENT)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 454, in _execute
        return self._parent.execute(command, params)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 199, in execute
        response = self.command_executor.execute(driver_command, params)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
        return self._request(command_info[0], url, body=data)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 463, in _request
        resp = opener.open(request, timeout=self._timeout)
      File "/usr/lib/python2.7/urllib2.py", line 431, in open
        response = self._open(req, data)
      File "/usr/lib/python2.7/urllib2.py", line 449, in _open
        '_open', req)
      File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open
        return self.do_open(httplib.HTTPConnection, req)
      File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open
        r = h.getresponse(buffering=True)
      File "/usr/lib/python2.7/httplib.py", line 1127, in getresponse
        response.begin()
      File "/usr/lib/python2.7/httplib.py", line 453, in begin
        version, status, reason = self._read_status()
      File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
        raise BadStatusLine(line)
    httplib.BadStatusLine: ''

代码:

class Crawler():
    def __init__(self,where_to_save, verbose = 0):
        self.link_to_explore = ''
        self.TAG_RE = re.compile(r'<[^>]+>')
        self.TAG_SCRIPT = re.compile(r'<(script).*?</\1>(?s)')
        if verbose == 1:
            self.driver = webdriver.Firefox()
        else:
            self.driver = webdriver.PhantomJS()
        self.links = []
        self.next = True
        self.where_to_save = where_to_save
        self.logs = self.where_to_save + "/logs"
        self.outputs = self.where_to_save + "/outputs"
        self.logfile = ''
        self.rnd = 0
        try:
            os.stat(self.logs)
        except:
            os.makedirs(self.logs)
        try:
            os.stat(self.outputs)
        except:
            os.makedirs(self.outputs)

try:
    fin = open(file_to_read,"r")
    FileContent = fin.read()
    fin.close()
    crawler =Crawler(where_to_save)
    data = FileContent.split("\n")
    for info in data:
        if info!="":
            to_process = info.split("|")
            link =     to_process[0].strip()
            category = to_process[1].strip().replace(' ','_')
            print "Processing the link: " + link : " + info
            crawler.Init(link,category)
            crawler.Run()
            crawler.End()
    crawler.closeSpider()
except:
    print "The input file cannot be read or no in proper format."
    raise

1 个答案:

答案 0 :(得分:0)

如果您不希望Timeout停止脚本,则可以捕获异常 public class DoctorTestWorkRequest extends WorkRequestNew并传递它。

您可以使用selenium.common.exceptions.TimeoutException set_page_load_timeout()方法设置默认页面加载超时。

喜欢这个

webdriver

如果您的页面在10秒内没有加载,则会抛出TimeoutException。

编辑: 忘记提到你必须把你的代码放在一个循环中。

添加导入

driver.set_page_load_timeout(10)

from selenium.common.exceptions import TimeoutException