Question

我正在用Python 3和Selenium编写我的第一个网络爬虫：


import selenium
from selenium import webdriver

GECKO_PATH = os.path.expanduser("~/code/geckodriver")
DRIVER = None

def get_page_javascript(url):
    "Downloads a simple webpage with javascript."
    driver = get_driver()
    driver.get(url)
    return driver.page_source

def crawl_page(url):
    """Gets the page source and returns all links on the page.
    """

    ...

def main():

   to_visit = ["<some website>"]

   while to_visit:
        url = to_visit.pop(0)

        try:
            print("About to visit '%s'" % url)
            html, new_links = crawl_page(url = url)
        except (urllib.error.HTTPError, urllib.error.URLError, selenium.common.exceptions.UnexpectedAlertPresentException, socket.timeout, urllib3.exceptions.ReadTimeoutError) as e:
            print("Skipping '%s' due to error: %s" % (url, str(e)))
            #get_driver().quit()
            #init_driver()
            continue

        # Handle html
        ...

        # Add new links
        to_visit.extend(new_links)
        # Note: I omitted code to avoid double visits

如果一页加载时间太长，则Gecko驱动程序仍停留在该页面上，所有后续调用都会失败。我得到了：

About to visit http://www.moretonhall.org/News
Skipping 'http://www.some-website.org/News' due to error: HTTPConnectionPool(host='127.0.0.1', port=62772): Read timed out. (read timeout=)
About to visit http://www.some-website.org/Other-News
Skipping 'http://www.some-website.org/Other-News' due to error: HTTPConnectionPool(host='127.0.0.1', port=62772): Read timed out. (read timeout=)
About to visit http://www.some-website.org/More-News
Skipping 'http://www.some-website.org/More-News' due to error: HTTPConnectionPool(host='127.0.0.1', port=62772): Read timed out. (read timeout=)
About to visit http://www.some-website.org/Another-Page

那时，Gecko驱动程序正在显示第一页some-website.org/News。

如果我尝试通过注释掉每个错误来重新启动驱动程序：

#get_driver().quit()
#init_driver()

然后，我会得到每个错误的新Gecko实例，以及未提及我的代码的未捕获错误，可能是因为退出驱动程序的调用是在自己的线程上进行的：

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "", line 2, in raise_from
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1336, in getresponse
    response.begin()
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 306, in begin
    version, status, reason = self._read_status()
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 267, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

我还尝试将How to set Selenium Python WebDriver default timeout?中的解决方案的超时时间增加到30秒，但没有任何效果，超时时间仍然是几秒钟。

如何正确地在Selenium中重新启动Gecko引擎，或告诉它放弃并使用新的URL重新启动？

如何重新启动Selenium Gecko驱动程序（HTTPConnectionPool（host ='127.0.0.1'，port = ...）：读取超时。）？

0 个答案: