我已经从该网站https://www.gmcameetings.co.uk上获取了我想要的pdf链接列表。 这是地方委员会会议的所有会议记录。 现在,我需要将所有结果保存到文件中,以便随后可以下载和阅读所有pdf。 我该如何保存它们?
这是我的代码:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
url = "https://www.gmcameetings.co.uk/"
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
folder_location = r'E:\Internship\WORK'
meeting_links = soup.find_all('a', href=True)
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
print(link['href'])
page2 = r2.text
soup2 = bs(page2, 'lxml')
date_links = soup2.find_all('a', href=True)
for dlink in date_links:
if dlink['href'].find('/meetings/')>1:
r3 = requests.get(dlink['href'])
print(dlink['href'])
page3 = r3.text
soup3 = bs(page3, 'lxml')
pdf_links = soup3.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
我需要一个包含所有链接的文件,然后可以从其中读取pdf。抱歉,我是完全不熟悉编码的新手,所以有点迷茫。
答案 0 :(得分:2)
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.gmcameetings.co.uk/"
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
f= open(r"E:\Internship\WORK\links.txt","w+")
n = 0
meeting_links = soup.find_all('a', href=True)
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
print(link['href'])
page2 = r2.text
soup2 = bs(page2, 'lxml')
date_links = soup2.find_all('a', href=True)
for dlink in date_links:
if dlink['href'].find('/meetings/')>1:
r3 = requests.get(dlink['href'])
print(dlink['href'])
page3 = r3.text
soup3 = bs(page3, 'lxml')
pdf_links = soup3.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
n += 1
print("Minutes!")
f.write("Link " + str(n) + ": " + str(plink['href']) +"\n")
f.close()
答案 1 :(得分:1)
只需使用一个普通的文本文件,像这样,然后在其中找到您需要的任何输出:
with open('Test.txt', 'w') as file:
file.write('Testing output')
答案 2 :(得分:1)
在写模式下的for循环之前声明文件,并在每次迭代中写入链接,并在每次添加时添加下一行。
with open('Linkfile.txt', 'w') as f:
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
date_links = soup2.find_all('a', href=True)
for dlink in date_links:
if dlink['href'].find('/meetings/')>1:
r3 = requests.get(dlink['href'])
print("link2")
page3 = r3.text
soup3 = bs(page3, 'lxml')
pdf_links = soup3.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print(plink['href'])
f.write(plink['href'])
f.write('\n')
答案 3 :(得分:1)
for link in meeting_links:
with open('filename.txt', 'a') as fp:
fp.write(link)
答案 4 :(得分:1)
我们可以使用Python的上下文管理器来打开文件(分配资源),并且一旦执行该操作,它也会关闭文件(释放资源)。
with open('links.txt', 'w') as file:
file.write('required content')
我们还可以根据需要指定文件类型扩展名,例如links.txt,links.csv等。