我想抓取一个由js生成一些内容的网站。 该站点每5秒运行一次js更新内容(请求一个新的encripted js文件,无法解析)。
我的代码:
it("should call abcClick function and set expand is true", function () {
var $scope = $rootScope.$new();
$scope.valueChanged= true;
var htmlDirectiveWithArgument = '<change-value-button valueChanged="{{valueChanged}}">" + "</change-value-button>';
var element = $compile(htmlDirectiveWithArgument)($scope);
$scope.$digest();
var queryResult = element[0].querySelector(".icon-fire");
var wrappedQueryResult = angular.element(queryResult);
wrappedQueryResult.triggerHandler("click");
var isolatedScope = element.isolateScope();
var abcResult = isolatedScope.valueChanged;
expect(false).toEqual(abcResult);
});
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get(url)
trs = driver.find_elements_by_css_selector('.table tbody tr')
print len(trs)
for tr in trs:
try:
items.append(tr.text)
except:
# because the js update content, so this tr is missing
pass
print len(items)
与len(items)
不匹配。
如何告诉selenium在我运行len(trs)
之后停止执行js或停止工作?
我稍后需要使用trs = driver.find_elements_by_css_selector('.table tbody tr')
,所以不能trs
例外细节:
driver.quit()
显然缺少tr。
PS:我需要使用selenium来选择元素。其他库,例如---------------------------------------------------------------------------
StaleElementReferenceException Traceback (most recent call last)
<ipython-input-84-b80e3579efca> in <module>()
11 items = []
12 for tr in trs:
---> 13 items.append(tr.text)
14 #items.append(map_label(hidemyass_label, tr.find_elements_by_tag_name('td')))
15
C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in text(self)
69 def text(self):
70 """The text of the element."""
---> 71 return self._execute(Command.GET_ELEMENT_TEXT)['value']
72
73 def click(self):
C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in _execute(self, command, params)
452 params = {}
453 params['id'] = self._id
--> 454 return self._parent.execute(command, params)
455
456 def find_element(self, by=By.ID, value=None):
C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.pyc in execute(self, driver_command, params)
199 response = self.command_executor.execute(driver_command, params)
200 if response:
--> 201 self.error_handler.check_response(response)
202 response['value'] = self._unwrap_value(
203 response.get('value', None))
C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.pyc in check_response(self, response)
179 elif exception_class == UnexpectedAlertPresentException and 'alert' in value:
180 raise exception_class(message, screen, stacktrace, value['alert'].get('text'))
--> 181 raise exception_class(message, screen, stacktrace)
182
183 def _value_or_default(self, obj, key, default):
StaleElementReferenceException: Message: {"errorMessage":"Element is no longer attached to the DOM","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63305","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"GET","url":"/text","urlParsed":{"anchor":"","query":"","file":"text","directory":"/","path":"/text","relative":"/text","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/text","queryKey":{},"chunks":["text"]},"urlOriginal":"/session/4bb16340-a3b6-11e5-8ce5-9d0be40203a6/element/%3Awdc%3A1450243990539/text"}}
Screenshot: available via screen
,lxml
不知道哪个元素是pyquery
,display:none
经常在.text()
中得到评论,等等在bug上。令人遗憾的是,python没有Jquery的完美克隆。
答案 0 :(得分:1)
使用scrapy。确定页面已加载后,使用以下方法抓取正文:
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
您现在拥有该页面的静态副本,以便您可以使用scrapy的response.xpath来提取您需要的任何数据。这个answer更详细。