从这个网页上刮掉pdf

时间:2018-01-15 20:07:16

标签: python pdf web-scraping

我正试图从这个网站废弃python 2.7:

http://www.motogp.com/en/Results+Statistics/

我想废弃主要的那个,它有很多类别(事件),一个出现在MotoGP Race Classification 2017蓝色字母旁边

此后的废品也是如此。到目前为止,我有:

import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "http://www.motogp.com/en/Results+Statistics/"
r  = urlopen(url).read()
soup = BeautifulSoup(r)
type(soup)

match = re.search(b'\"(.*?\.pdf)\"', r)
pdf_url="http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification" + match.group(1).decode('utf8')

链接是这种类型:

http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c

所以我应该添加#34;?"在角色之后。主要问题是如何从事件切换到事件以获得这种格式的所有链接。

1 个答案:

答案 0 :(得分:1)

根据您在上面提供的说明,这是如何获取这些pdf链接:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
driver.get("http://www.motogp.com/en/Results+Statistics/")

for item in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#event option"))):
    item.click()
    elem = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "padleft5")))
    print(elem.get_attribute("href"))
    wait.until(EC.staleness_of(elem))

driver.quit()

部分输出:

http://resources.motogp.com/files/results/2017/VAL/MotoGP/RAC/worldstanding.pdf?v1_8dbea75c
http://resources.motogp.com/files/results/2017/QAT/MotoGP/RAC/Classification.pdf?v1_f6564614
http://resources.motogp.com/files/results/2017/ARG/MotoGP/RAC/Classification.pdf?v1_9107e18d
http://resources.motogp.com/files/results/2017/AME/MotoGP/RAC/Classification.pdf?v1_ef0b514c
http://resources.motogp.com/files/results/2017/SPA/MotoGP/RAC/Classification.pdf?v1_ba33b120