我尝试从以下站点下载所有报告:https://www.opec.org/opec_web/en/publications/4814.htm 但我无法自动找到漂亮的汤和要求的链接。有人可以帮我吗?
到目前为止,我已经尝试了以下代码:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("https://www.opec.org/opec_web/static_files_project/media")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
print(link.get('href'))
答案 0 :(得分:2)
您的代码应类似于
如果它是html文档,则应使用“ html.parser”,并应链接到请求中的正确网址。
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("https://www.opec.org/opec_web/en/publications/4814.htm")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "html.parser")
links = []
for link in soup.findAll('a'):
href = link.get('href')
if 'pdf' in href:
print(href)