我正在用Python 3和Selenium编写我的第一个网络爬虫:
import selenium
from selenium import webdriver
GECKO_PATH = os.path.expanduser("~/code/geckodriver")
DRIVER = None
def get_page_javascript(url):
"Downloads a simple webpage with javascript."
driver = get_driver()
driver.get(url)
return driver.page_source
def crawl_page(url):
"""Gets the page source and returns all links on the page.
"""
...
def main():
to_visit = ["<some website>"]
while to_visit:
url = to_visit.pop(0)
try:
print("About to visit '%s'" % url)
html, new_links = crawl_page(url = url)
except (urllib.error.HTTPError, urllib.error.URLError, selenium.common.exceptions.UnexpectedAlertPresentException, socket.timeout, urllib3.exceptions.ReadTimeoutError) as e:
print("Skipping '%s' due to error: %s" % (url, str(e)))
#get_driver().quit()
#init_driver()
continue
# Handle html
...
# Add new links
to_visit.extend(new_links)
# Note: I omitted code to avoid double visits
如果一页加载时间太长,则Gecko驱动程序仍停留在该页面上,所有后续调用都会失败。我得到了:
About to visit http://www.moretonhall.org/News Skipping 'http://www.some-website.org/News' due to error: HTTPConnectionPool(host='127.0.0.1', port=62772): Read timed out. (read timeout=) About to visit http://www.some-website.org/Other-News Skipping 'http://www.some-website.org/Other-News' due to error: HTTPConnectionPool(host='127.0.0.1', port=62772): Read timed out. (read timeout=) About to visit http://www.some-website.org/More-News Skipping 'http://www.some-website.org/More-News' due to error: HTTPConnectionPool(host='127.0.0.1', port=62772): Read timed out. (read timeout=) About to visit http://www.some-website.org/Another-Page
那时,Gecko驱动程序正在显示第一页some-website.org/News
。
如果我尝试通过注释掉每个错误来重新启动驱动程序:
#get_driver().quit()
#init_driver()
然后,我会得到每个错误的新Gecko实例,以及未提及我的代码的未捕获错误,可能是因为退出驱动程序的调用是在自己的线程上进行的:
Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request six.raise_from(e, None) File "", line 2, in raise_from File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request httplib_response = conn.getresponse() File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 1336, in getresponse response.begin() File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 306, in begin version, status, reason = self._read_status() File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/http/client.py", line 267, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) socket.timeout: timed out
我还尝试将How to set Selenium Python WebDriver default timeout?中的解决方案的超时时间增加到30秒,但没有任何效果,超时时间仍然是几秒钟。
如何正确地在Selenium中重新启动Gecko引擎,或告诉它放弃并使用新的URL重新启动?