Question

我正试图通过以下script抓取此网页。

我等不及这个元素而且它没有正确刮擦。

clickMe = wait(driver, 3).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ('//a[@class='style-scope match-pop-market']'))))

Chrome检查中的元素是正确的。

//a[@class='style-scope match-pop-market']

如何获取当前页面elem_href，而不是看不到其他页面上的其他元素。

//div[@class='mpm_match_title' and .//div[@class='mpm_match_title style-scope match-pop-market']]//a[@class='style-scope match-pop-market']

不起作用，但这应该在理论上解决这个问题。有任何想法吗？当前输出：

None
None
None
None
None
None
None
None
None
None
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6381070
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386987
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386988
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386989
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386990
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386991
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386992
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387025
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387026
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387027
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387028

无法等待元素，因为它想要等待当前页面上不可见的元素。

所以：

//div[contains(@class, 'mpm_match_title')] #TEXT
//div[contains(@class, 'mpm_match_title style-scope match-pop-market')]  #BAR
//a[contains(@class, 'style-scope match-pop-market')] #HREF
style-scope match-pop-market

组合：

//div[contains(@class, 'mpm_match_title') and .//div[contains(@class, 'mpm_match_title style-scope match-pop-market')]//a[@class='style-scope match-pop-market']

无法找到。

期望的输出：

https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6381070
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386987
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386988
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386989
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386990
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386991
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386992
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387025
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387026
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387027
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387028

Answer 1

使用评论中的pastebin链接中的代码，我基本上只修改了Xpath以搜索可识别当前页面上链接的特定元素。

from random import shuffle

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait

driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get('https://www.palmerbet.com/sports/soccer')

clickMe = wait(driver, 3).until(EC.element_to_be_clickable((By.XPATH, 
    ('//*[contains(@class,"filter_labe")]'))))
options = driver.find_elements_by_xpath('//*[contains(@class,"filter_labe")]')

indexes = [index for index in range(len(options))]
shuffle(indexes)

xp = '//sport-match-grp[not(contains(@style, "display: none;"))]' \
    '//match-pop-market[@class="sport-match-grp" and ' \
    'not(contains(@style, "display: none;")) and ' \
    './/a[@id="match_link" and boolean(@href)]]'

for index in indexes:
    print(f'Loading index {index}')
    driver.get('https://www.palmerbet.com/sports/soccer')
    clickMe1 = wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,
        '(//ul[@id="tournaments"]//li//input)[%s]' % str(index + 1))))
    driver.execute_script("arguments[0].scrollIntoView();", clickMe1)
    clickMe1.click()

    try:
        # this attempts to find any links on the page
        clickMe = wait(driver, 3).until(EC.element_to_be_clickable((
            By.XPATH, xp)))
        elems = driver.find_elements_by_xpath(xp)

        elem_href = []
        for elem in elems:
            print(elem.find_element_by_xpath('.//a[@id="match_link"]')
                .get_attribute('href'))
            elem_href.append(elem.get_attribute("href"))
    except:
        print(f'There are no matches in index {index}.')

等待不在页面上的隐形元素

1 个答案: