我正试图通过以下script抓取此网页。
我等不及这个元素而且它没有正确刮擦。
clickMe = wait(driver, 3).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ('//a[@class='style-scope match-pop-market']'))))
Chrome检查中的元素是正确的。
//a[@class='style-scope match-pop-market']
如何获取当前页面elem_href,而不是看不到其他页面上的其他元素。
//div[@class='mpm_match_title' and .//div[@class='mpm_match_title style-scope match-pop-market']]//a[@class='style-scope match-pop-market']
不起作用,但这应该在理论上解决这个问题。有任何想法吗?当前输出:
None
None
None
None
None
None
None
None
None
None
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6381070
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386987
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386988
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386989
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386990
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386991
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386992
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387025
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387026
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387027
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387028
无法等待元素,因为它想要等待当前页面上不可见的元素。
所以:
//div[contains(@class, 'mpm_match_title')] #TEXT
//div[contains(@class, 'mpm_match_title style-scope match-pop-market')] #BAR
//a[contains(@class, 'style-scope match-pop-market')] #HREF
style-scope match-pop-market
组合:
//div[contains(@class, 'mpm_match_title') and .//div[contains(@class, 'mpm_match_title style-scope match-pop-market')]//a[@class='style-scope match-pop-market']
无法找到。
期望的输出:
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6381070
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386987
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386988
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386989
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386990
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386991
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6386992
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387025
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387026
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387027
https://www.palmerbet.com/sports/soccer/italy-serie-b/match/6387028
答案 0 :(得分:0)
使用评论中的pastebin链接中的代码,我基本上只修改了Xpath以搜索可识别当前页面上链接的特定元素。
from random import shuffle
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get('https://www.palmerbet.com/sports/soccer')
clickMe = wait(driver, 3).until(EC.element_to_be_clickable((By.XPATH,
('//*[contains(@class,"filter_labe")]'))))
options = driver.find_elements_by_xpath('//*[contains(@class,"filter_labe")]')
indexes = [index for index in range(len(options))]
shuffle(indexes)
xp = '//sport-match-grp[not(contains(@style, "display: none;"))]' \
'//match-pop-market[@class="sport-match-grp" and ' \
'not(contains(@style, "display: none;")) and ' \
'.//a[@id="match_link" and boolean(@href)]]'
for index in indexes:
print(f'Loading index {index}')
driver.get('https://www.palmerbet.com/sports/soccer')
clickMe1 = wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH,
'(//ul[@id="tournaments"]//li//input)[%s]' % str(index + 1))))
driver.execute_script("arguments[0].scrollIntoView();", clickMe1)
clickMe1.click()
try:
# this attempts to find any links on the page
clickMe = wait(driver, 3).until(EC.element_to_be_clickable((
By.XPATH, xp)))
elems = driver.find_elements_by_xpath(xp)
elem_href = []
for elem in elems:
print(elem.find_element_by_xpath('.//a[@id="match_link"]')
.get_attribute('href'))
elem_href.append(elem.get_attribute("href"))
except:
print(f'There are no matches in index {index}.')