如何将结果列表保存到文件

时间:2019-07-15 11:12:03

标签: python pdf web-scraping

我已经从该网站https://www.gmcameetings.co.uk上获取了我想要的pdf链接列表。 这是地方委员会会议的所有会议记录。 现在,我需要将所有结果保存到文件中,以便随后可以下载和阅读所有pdf。 我该如何保存它们?

这是我的代码:

import requests
import urllib.request
import time 
from bs4 import BeautifulSoup as bs

url = "https://www.gmcameetings.co.uk/"

r = requests.get(url)
page = r.text
soup = bs(page,'lxml')

folder_location = r'E:\Internship\WORK'

meeting_links = soup.find_all('a', href=True)

for link in meeting_links:
    if link['href'].find('/meetings/')>1:
        r2 = requests.get(link['href'])
        print(link['href'])
        page2 = r2.text
        soup2 = bs(page2, 'lxml')
        date_links = soup2.find_all('a', href=True)
        for dlink in date_links:
            if dlink['href'].find('/meetings/')>1:
                r3 = requests.get(dlink['href'])
                print(dlink['href'])
                page3 = r3.text
                soup3 = bs(page3, 'lxml')
                pdf_links = soup3.find_all('a', href=True)
                for plink in pdf_links:
                    if plink['href'].find('minutes')>1:
                        print("Minutes!")

我需要一个包含所有链接的文件,然后可以从其中读取pdf。抱歉,我是完全不熟悉编码的新手,所以有点迷茫。

5 个答案:

答案 0 :(得分:2)

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.gmcameetings.co.uk/"

r = requests.get(url)
page = r.text
soup = bs(page,'lxml')

f= open(r"E:\Internship\WORK\links.txt","w+")
n = 0

meeting_links = soup.find_all('a', href=True)

for link in meeting_links:
    if link['href'].find('/meetings/')>1:
        r2 = requests.get(link['href'])
        print(link['href'])
        page2 = r2.text
        soup2 = bs(page2, 'lxml')
        date_links = soup2.find_all('a', href=True)
        for dlink in date_links:
            if dlink['href'].find('/meetings/')>1:
                r3 = requests.get(dlink['href'])
                print(dlink['href'])
                page3 = r3.text
                soup3 = bs(page3, 'lxml')
                pdf_links = soup3.find_all('a', href=True)
                for plink in pdf_links:
                    if plink['href'].find('minutes')>1:
                        n += 1
                        print("Minutes!")
                        f.write("Link " + str(n) + ": " + str(plink['href']) +"\n")
f.close()

答案 1 :(得分:1)

只需使用一个普通的文本文件,像这样,然后在其中找到您需要的任何输出: with open('Test.txt', 'w') as file: file.write('Testing output')

答案 2 :(得分:1)

在写模式下的for循环之前声明文件,并在每次迭代中写入链接,并在每次添加时添加下一行。

with open('Linkfile.txt', 'w') as f:
 for link in meeting_links:
    if link['href'].find('/meetings/')>1:
        r2 = requests.get(link['href'])
        print("link1")
        page2 = r2.text
        soup2 = bs(page2, 'lxml')
        date_links = soup2.find_all('a', href=True)
        for dlink in date_links:
            if dlink['href'].find('/meetings/')>1:
                r3 = requests.get(dlink['href'])
                print("link2")
                page3 = r3.text
                soup3 = bs(page3, 'lxml')
                pdf_links = soup3.find_all('a', href=True)
                for plink in pdf_links:
                    if plink['href'].find('minutes')>1:
                        print(plink['href'])
                        f.write(plink['href'])
                        f.write('\n')

答案 3 :(得分:1)

for link in meeting_links:
    with open('filename.txt', 'a') as fp:
        fp.write(link)

答案 4 :(得分:1)

我们可以使用Python的上下文管理器来打开文件(分配资源),并且一旦执行该操作,它也会关闭文件(释放资源)。

with open('links.txt', 'w') as file: file.write('required content')

我们还可以根据需要指定文件类型扩展名,例如links.txt,links.csv等。