美丽汤未显示正确的网址

时间:2019-02-06 18:26:33

标签: python-3.x web-scraping beautifulsoup

我正在使用以下URL:https://www.sesgovernance.com/archived-reports?tpages=261&load_ajax=1&page=0&company_name=&meeting_type=&from_date=&to_date=

我想从中提取所有.pdf链接。当我使用BeautifulSoup打开它时,所有链接都被剪切,而当我使用命令urllib.request.urlopen(url)

时,我可以完美地看到这些链接。

有人可以帮我找回那些.pdf链接吗?

fhand = urllib.request.urlopen('https://www.sesgovernance.com/archived-reports?tpages=261&load_ajax=1&page=0&company_name=&meeting_type=&from_date=&to_date=').read()
soup = BeautifulSoup(fhand,'lxml')

2 个答案:

答案 0 :(得分:2)

这是因为您的页面实际上提供了json,而不是html。因此,fhand中包含一些转义,并且进行了解析,这毫无意义。您真正想要的是fhand的消息字段。这应该起作用:

import urllib.request
from bs4 import BeautifulSoup
import json


fhand = urllib.request.urlopen('https://www.sesgovernance.com/archived-reports?tpages=261&load_ajax=1&page=0&company_name=&meeting_type=&from_date=&to_date=').read()

HTML = json.loads(fhand)['message']
soup = BeautifulSoup(HTML, 'lxml')
a_tags = soup.find_all('a')
for a_tag in a_tags:
    url = a_tag['href']
    if '.pdf' in url:
        print(url)

注意:我建议您使用请求包而不是urllib。

答案 1 :(得分:2)

两个选项:

a)您可以解析json响应。

b)使用selenium直接废弃sesgovernance.com/archived-reports?tpages=261

还提醒您避免使用urllib.request.urlopen,因为它已被弃用。按照建议的Requests使用here

import requests
from bs4 import BeautifulSoup
import re
import json

req = requests.get('https://www.sesgovernance.com/archived-reports?tpages=261&load_ajax=1&page=0&company_name=&meeting_type=&from_date=&to_date=')
req.raise_for_status()
resp = json.loads(req.text)['message']
soup = BeautifulSoup(resp, 'html.parser')
pdf_list = soup.find_all('a', href=re.compile(r'pdf'))
print(pdf_list)

输出:

[<a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/0925183203Asahi India Glass Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3020518587Sobha Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/2151608573Avanti Feeds Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3017832951AU Small Finance Bank Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/8183831859Mahindra &amp; Mahindra Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3259351215Wonderla Holidays Ltd._SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/0451115532Hawkins Cooker Ltd_SES Proxy Advisory Report_AGM_07 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3995521831ISGEC Heavy Engineering  Ltd._SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3156275331Kalpataru Power Transmission Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/1356756312Adani Enterprises Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/5612331522Adani Transmission Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3515944823Mphasis Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/5953271399Bombay Dyeing &amp; Manufacturing Company Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/3455819426TVS Motor Company Ltd_SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>, <a class="view-btn" href="https://portal.sesgovernance.com/proxy_reports/0355221651SRF Ltd._SES Proxy Advisory Report_AGM_7 August 2018.pdf" target="_blank">View</a>]