time.sleep给出了所需的刮擦但等到没有

时间:2017-12-23 17:40:19

标签: python selenium selenium-webdriver

为什么当我添加time.sleep(2)时,我得到了我想要的输出但是如果我添加等待直到特定的xpath它会得到更少的结果?

使用time.sleep(2)输出(也需要):

Adelaide Utd
Tottenham
Dundee Fc
 ...

数:145个名字

删除time.sleep

Adelaide Utd
Tottenham
Dundee Fc
 ...

数:119名

我已添加:

clickMe = wait(driver,    13).until(EC.element_to_be_clickable((By.CSS_SELECTOR,    ("#page-container > div:nth-child(4) > div >    div.ubet-sports-section-page > div > div:nth-child(2) > div > div >    div:nth-child(1) > div > div > div.page-title-new > h1"))))

由于此元素出现在所有页面上。

似乎要少得多。我怎样才能解决这个问题?

脚本:

import csv
import os

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait as wait


driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()

driver.get('https://ubet.com/sports/soccer')



clickMe = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, ('//select[./option="Soccer"]/option'))))

options = driver.find_elements_by_xpath('//select[./option="Soccer"]/option')


indexes = [index for index in range(len(options))]
for index in indexes:


    try:
        try:
            zz = wait(driver, 10).until(
                EC.element_to_be_clickable((By.XPATH, '(//select/optgroup/option)[%s]' % str(index + 1))))
            zz.click()
        except StaleElementReferenceException:
            pass

        from selenium.webdriver.support.ui import WebDriverWait
        def find(driver):
            pass

        from selenium.common.exceptions import StaleElementReferenceException, NoSuchElementException
        import time
        clickMe = wait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ("#page-container > div:nth-child(4) > div > div.ubet-sports-section-page > div > div:nth-child(2) > div > div > div:nth-child(1) > div > div > div.page-title-new > h1"))))

        langs0 = driver.find_elements_by_css_selector(
            "div > div > div > div > div > div > div > div > div.row.collapse > div > div > div:nth-child(2) > div > div > div > div > div > div.row.small-collapse.medium-collapse > div:nth-child(1) > div > div > div > div.lbl-offer > span")
        langs0_text = []

        for lang in langs0:
            try:
                langs0_text.append(lang.text)
            except StaleElementReferenceException:
                pass


        directory = 'C:\\A.csv' #####################################
        with open(directory, 'a', newline='', encoding="utf-8") as outfile:
            writer = csv.writer(outfile)
            for row in zip(langs0_text):
                writer.writerow(row)
    except StaleElementReferenceException:
        pass

如果您无法访问页面,则需要vpn。

更新...

也许该元素在其他元素之前加载。因此,如果我们将其更改为datascraped(并非所有页面都有要删除的数据)。

添加:

尝试:

    clickMe = wait(driver, 13).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ("div > div > div > div > div > div > div > div > div.row.collapse > div > div > div:nth-child(2) > div > div > div > div > div > div.row.small-collapse.medium-collapse > div:nth-child(3) > div > div > div > div.lbl-offer > span"))))
except TimeoutException as ex:
    pass

同样的问题仍然存在

手动步骤:

#Load driver.get('https://ubet.com/sports/soccer')
#Click drop down (//select/optgroup/option
#Wait for page elements so can scrape
Scrape:

    div > div > div > div > div > div > div > div > div.row.collapse > div > div > div:nth-child(2) > div > div > div > div > div > div.row.small-collapse.medium-collapse > div:nth-child(1) > div > div > div > div.lbl-offer > span
Loop repeat.  

2 个答案:

答案 0 :(得分:6)

该网站建立在angularjs上,所以最好的办法是等到角度处理完所有AJAX请求后(我不会深入了解底层机制,但整个网络上有很多关于该主题的资料) )。为此,我通常会在等待时定义要检查的自定义预期条件:

class NgReady:

    js = ('return (window.angular !== undefined) && '
          '(angular.element(document).injector() !== undefined) && '
          '(angular.element(document).injector().get("$http").pendingRequests.length === 0)')

    def __call__(self, driver):
        return driver.execute_script(self.js)

# NgReady does not have any internal state, so one instance 
# can be reused for waiting multiple times
ng_ready = NgReady()

现在用它在zz.click()之后等待:

zz.click()
wait(driver, 10).until(ng_ready)

测试

  1. 原始代码未经修改(没有睡觉或等待ng_ready):

    $ python so-47954604.py && wc -l out.csv && rm out.csv
    86 out.csv
    
  2. time.sleep(10)之后使用zz.click()

    $ python so-47954604.py && wc -l out.csv && rm out.csv
    101 out.csv
    
  3. wait(driver, 10).until(ng_ready)之后使用zz.click()时的结果相同:

    $ python so-47954604.py && wc -l out.csv && rm out.csv
    101 out.csv
    
  4. 积分

    NgReady不是我的发明,我只是将它从Java中实现的预期条件移植到python我发现here,所以所有的信用都归到了答案的作者。

答案 1 :(得分:4)

@hoefling的想法绝对是正确的,但这里是“等待Angular”部分的补充。

NgReady中使用的逻辑仅检查要定义的角度,并且没有待处理的待处理请求。即使它适用于这个网站,它也不是Angular准备好与一起使用的问题的明确答案。

如果我们查看what Protractor - the Angular end-to-end testing framework - does to "sync" with Angular,则会使用Angular内置的"Testability" API

还有这个pytractor package使用WebDriverMixin扩展了selenium webdriver实例,会在每次交互时自动保持驱动程序和角度之间的同步

您可以直接开始使用pytractor(虽然它被放弃作为一个包)。或者,我们可以尝试apply the ideas implemented there in order to always keep our webdriver synced with Angular。为此,让我们创建这个waitForAngular.js script(我们只使用Angular 1和2支持逻辑 - 我们总是可以使用相关的Protractor的客户端脚本来扩展它):

try { return (function (rootSelector, callback) {
  var el = document.querySelector(rootSelector);
  try {
    if (!window.angular) {
      throw new Error('angular could not be found on the window');
    }

    if (angular.getTestability) {
      angular.getTestability(el).whenStable(callback);
    } else {
      if (!angular.element(el).injector()) {
        throw new Error('root element (' + rootSelector + ') has no injector.' +
           ' this may mean it is not inside ng-app.');
      }
      angular.element(el).injector().get('$browser').
          notifyWhenNoOutstandingRequests(callback);
    }
  } catch (err) {
    callback(err.message);
  }
}).apply(this, arguments); }
catch(e) { throw (e instanceof Error) ? e : new Error(e); }

然后,让我们继承webdriver.Chrome并修补execute()方法 - 这样每次进行交互时,我们都会在交互之前检查Angular是否准备就绪:

import csv

from selenium import webdriver
from selenium.webdriver.remote.command import Command
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC


COMMANDS_NEEDING_WAIT = [
    Command.CLICK_ELEMENT,
    Command.SEND_KEYS_TO_ELEMENT,
    Command.GET_ELEMENT_TAG_NAME,
    Command.GET_ELEMENT_VALUE_OF_CSS_PROPERTY,
    Command.GET_ELEMENT_ATTRIBUTE,
    Command.GET_ELEMENT_TEXT,
    Command.GET_ELEMENT_SIZE,
    Command.GET_ELEMENT_LOCATION,
    Command.IS_ELEMENT_ENABLED,
    Command.IS_ELEMENT_SELECTED,
    Command.IS_ELEMENT_DISPLAYED,
    Command.SUBMIT_ELEMENT,
    Command.CLEAR_ELEMENT
]


class ChromeWithAngular(webdriver.Chrome):
    def __init__(self, root_element, *args, **kwargs):
        self.root_element = root_element

        with open("waitForAngular.js") as f:
            self.script = f.read()

        super(ChromeWithAngular, self).__init__(*args, **kwargs)

    def wait_for_angular(self):
        self.execute_async_script(self.script, self.root_element)

    def execute(self, driver_command, params=None):
        if driver_command in COMMANDS_NEEDING_WAIT:
            self.wait_for_angular()
        return super(ChromeWithAngular, self).execute(driver_command, params=params)


driver = ChromeWithAngular(root_element='body')

# the rest of the code as is with what you had 

同样,pytractorprotractor项目严重影响了这一点。