什么方式告诉selenium在某些时候不执行js?

时间:2015-12-16 02:38:43

标签: python selenium web-crawler

我想抓取一个由js生成一些内容的网站。 该站点每5秒运行一次js更新内容(请求一个新的encripted js文件,无法解析)。

我的代码:

it("should call abcClick function and set expand is true", function () {
            var $scope = $rootScope.$new();
            $scope.valueChanged= true;
            var htmlDirectiveWithArgument = '<change-value-button valueChanged="{{valueChanged}}">" + "</change-value-button>';
            var element = $compile(htmlDirectiveWithArgument)($scope);
            $scope.$digest();

            var queryResult = element[0].querySelector(".icon-fire");
            var wrappedQueryResult = angular.element(queryResult);
            wrappedQueryResult.triggerHandler("click");

            var isolatedScope = element.isolateScope();
            var abcResult = isolatedScope.valueChanged;
            expect(false).toEqual(abcResult);
        });

from selenium import webdriver driver = webdriver.PhantomJS() driver.set_window_size(1120, 550) driver.get(url) trs = driver.find_elements_by_css_selector('.table tbody tr') print len(trs) for tr in trs: try: items.append(tr.text) except: # because the js update content, so this tr is missing pass print len(items) len(items)不匹配。 如何告诉selenium在我运行len(trs)之后停止执行js或停止工作?

我稍后需要使用trs = driver.find_elements_by_css_selector('.table tbody tr'),所以不能trs

例外细节

driver.quit()

显然缺少tr。

PS:我需要使用selenium来选择元素。其他库,例如--------------------------------------------------------------------------- StaleElementReferenceException Traceback (most recent call last) <ipython-input-84-b80e3579efca> in <module>() 11 items = [] 12 for tr in trs: ---> 13 items.append(tr.text) 14 #items.append(map_label(hidemyass_label, tr.find_elements_by_tag_name('td'))) 15 C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in text(self) 69 def text(self): 70 """The text of the element.""" ---> 71 return self._execute(Command.GET_ELEMENT_TEXT)['value'] 72 73 def click(self): C:\Python27\lib\site-packages\selenium\webdriver\remote\webelement.pyc in _execute(self, command, params) 452 params = {} 453 params['id'] = self._id --> 454 return self._parent.execute(command, params) 455 456 def find_element(self, by=By.ID, value=None): C:\Python27\lib\site-packages\selenium\webdriver\remote\webdriver.pyc in execute(self, driver_command, params) 199 response = self.command_executor.execute(driver_command, params) 200 if response: --> 201 self.error_handler.check_response(response) 202 response['value'] = self._unwrap_value( 203 response.get('value', None)) C:\Python27\lib\site-packages\selenium\webdriver\remote\errorhandler.pyc in check_response(self, response) 179 elif exception_class == UnexpectedAlertPresentException and 'alert' in value: 180 raise exception_class(message, screen, stacktrace, value['alert'].get('text')) --> 181 raise exception_class(message, screen, stacktrace) 182 183 def _value_or_default(self, obj, key, default): StaleElementReferenceException: Message: {"errorMessage":"Element is no longer attached to the DOM","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:63305","User-Agent":"Python-urllib/2.7"},"httpVersion":"1.1","method":"GET","url":"/text","urlParsed":{"anchor":"","query":"","file":"text","directory":"/","path":"/text","relative":"/text","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/text","queryKey":{},"chunks":["text"]},"urlOriginal":"/session/4bb16340-a3b6-11e5-8ce5-9d0be40203a6/element/%3Awdc%3A1450243990539/text"}} Screenshot: available via screen lxml不知道哪个元素是pyquerydisplay:none经常在.text()中得到评论,等等在bug上。令人遗憾的是,python没有Jquery的完美克隆。

1 个答案:

答案 0 :(得分:1)

使用scrapy。确定页面已加载后,使用以下方法抓取正文:

response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')

您现在拥有该页面的静态副本,以便您可以使用scrapy的response.xpath来提取您需要的任何数据。这个answer更详细。