我已经设置了一些代码以将pdfs刮出当地市政厅的网站。我已经请求了我想要的页面,然后获得了指向不同日期的链接,然后在每个页面中获得了指向pdf的链接。但是,它不会返回任何结果。
我已经试过了代码,无法弄清楚。它在jupyter笔记本电脑中运行正常,并且未返回任何错误。
这是我的代码:
import requests
from bs4 import BeautifulSoup as bs
dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')
f = open(r"E:\Internship\WORK\GMCA\Getting PDFS\gmcabusinessdatelinks.txt", "w+")
for date in dates:
if ['a'] in soup.select('a:contains("' + date + '")'):
r2 = requests.get(date['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
pdf_links = soup2.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
f.write(str(plink['href']) + ' ')
f.close()
它创建一个文本文件,但为空白。我想要一个包含所有pdf链接的文本文件。谢谢。
答案 0 :(得分:1)
如果您想获取包含minutes
关键字的pdf链接,则应该可以进行以下操作:
import requests
from bs4 import BeautifulSoup
link = 'https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny'
dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get(link)
soup = BeautifulSoup(r.text, 'lxml')
target_links = [[i['href'] for i in soup.select(f'a:contains("{date}")')] for date in dates]
with open("output_file.txt","w",encoding="utf-8") as f:
for target_link in target_links:
res = requests.get(target_link[0])
soup_obj = BeautifulSoup(res.text,"lxml")
pdf_links = [item.get("href") for item in soup_obj.select("#content .item-list a[href*='minutes']")]
for pdf_file in pdf_links:
print(pdf_file)
f.write(pdf_file+"\n")
答案 1 :(得分:1)
可以改用正则表达式soup.find('a', text=re.compile(date))
:
import requests
from bs4 import BeautifulSoup as bs
import re
dates = ['April 2019', 'July 2019', 'December 2018']
r = requests.get('https://www.gmcameetings.co.uk/meetings/committee/36/economy_business_growth_and_skills_overview_and_scrutiny')
soup = bs(r.content, 'lxml')
f = open(r"E:\gmcabusinessdatelinks.txt", "w+")
for date in dates:
link = soup.find('a', text=re.compile(date))
r2 = requests.get(link['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
pdf_links = soup2.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
f.write(str(plink['href']) + ' ')
f.close()