我正在使用Selenium和Python绑定从无头Firefox的网页中抓取AJAX内容。它在我的本地机器上运行时效果很好。当我在我的VPS上运行完全相同的脚本时,会在看似随机(但一致)的行上抛出错误。我的本地和远程系统具有相同的操作系统/体系结构,所以我猜测差异与VPS相关。
对于每个回溯,该行在引发错误之前运行4次。
在执行JavaScript以将元素滚动到视图中时,我经常会遇到此URLError。
File "google_scrape.py", line 18, in _get_data
driver.execute_script("arguments[0].scrollIntoView(true);", e)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
{'script': script, 'args':converted_args})['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
在从元素中读取文本时,偶尔会得到BadStatusLine。
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 373, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
有几次我收到了套接字错误:
File "google_scrape.py", line 19, in _get_data
if e.text.strip():
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
return self._parent.execute(command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
return self._request(url, method=command_info[0], data=data)
File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
response = opener.open(request)
File "/usr/lib64/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer
我是在没有代理的情况下从Google上抓取的,所以我首先想到的是我的IP地址被识别为VPS并被置于5次页面操作限制之下。但我最初的研究表明,这些错误不会因为被阻止而产生。
对于这些错误的集体意义,或者从VPS发出HTTP请求时必要的考虑因素,我们将非常感激。
经过一番思考并研究了webdriver的真正含义 - 自动浏览器输入 - 我应该对remote_connection.py
提出urllib2
请求的原因感到困惑。看起来text
类的WebElement
方法是python绑定的“额外”特性,它不是Selenium核心的一部分。这并不能解释上述错误,但可能表明text
方法不应用于抓取。
我意识到,就我的目的而言,Selenium唯一的功能是加载ajax内容。所以在页面加载之后,我用lxml
解析源代码而不是用Selenium获取元素,即:
html = lxml.html.fromstring(driver.page_source)
但是,page_source
是导致调用urllib2
的另一种方法,我第二次使用时始终遇到BadStatusLine
错误。最小化urllib2
请求绝对是朝着正确方向迈出的一步。
通过使用javascript抓取源来消除urllib2
请求更好:
html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))
通过在每几个请求之间执行time.sleep(10)
可以避免这些错误。我提出的最佳解释是,Google的防火墙将我的IP识别为VPS,因此将其置于更严格的阻止规则下。
这是我最初的想法,但我仍然觉得很难相信,因为我的网络搜索并未表明上述错误可能是由防火墙造成的。
如果是这种情况,我认为可以通过代理规避更严格的规则,尽管该代理可能必须是本地系统或托管以避免相同的限制。
答案 0 :(得分:3)
根据我们的谈话,您发现即使是少量的每日搜索,Google也会采用防刮方法。解决方案是在每次获取之间延迟几秒钟。
在一般情况下,由于您在技术上将不可恢复的成本转移给第三方,因此尝试减少放置在远程服务器上的额外资源负载始终是一种好习惯。如果没有HTTP提取之间的暂停,快速服务器和连接可能会导致远程拒绝服务,尤其是刮掉没有Google服务器资源的目标。