我正试图从这个网站废弃python 2.7:
http://www.motogp.com/en/Results+Statistics/
我想废弃主要的那个,它有很多类别(事件),一个出现在MotoGP Race Classification 2017蓝色字母旁边
此后的废品也是如此。到目前为止,我有:
import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://www.motogp.com/en/Results+Statistics/"
r = urlopen(url).read()
soup = BeautifulSoup(r)
type(soup)
match = re.search(b'\"(.*?\.pdf)\"', r)
pdf_url="http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification" + match.group(1).decode('utf8')
链接是这种类型:
http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c
所以我应该添加#34;?"在角色之后。主要问题是如何从事件切换到事件以获得这种格式的所有链接。
答案 0 :(得分:1)
根据您在上面提供的说明,这是如何获取这些pdf
链接:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("http://www.motogp.com/en/Results+Statistics/")
for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):
item.click()
elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))
print(elem.get_attribute("href"))
wait.until(EC.staleness_of(elem))
driver.quit()
部分输出:
http://resources.motogp.com/files/results/2017/VAL/MotoGP/RAC/worldstanding.pdf?v1_8dbea75c
http://resources.motogp.com/files/results/2017/QAT/MotoGP/RAC/Classification.pdf?v1_f6564614
http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification.pdf?v1_9107e18d
http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c
http://resources.motogp.com/files/results/2017/SPA/MotoGP/RAC/Classification.pdf?v1_ba33b120