如何使用BeautifulSoup查找所有下一个链接

时间:2017-03-28 17:15:01

标签: python python-3.x web-scraping beautifulsoup

我目前正在通过预设名为number_of_pages的变量来抓取特定网站的所有页面。预设此变量有效,直到添加了我不了解的新页面。例如,下面的代码是3页,但网站现在有4页。

base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
number_of_pages = 3
for i in range(1, number_of_pages, 1):
   url_to_scrape = (base_url + str(i))

我想使用BeautifulSoup查找网站上的所有下一个链接。下面的代码找到第二个URL,但不是第三个或第四个URL。如何在抓取它们之前构建所有页面的列表?

base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
CrawlRequest = requests.get(base_url)
raw_html = CrawlRequest.text
linkSoupParser = BeautifulSoup(raw_html, 'html.parser')
page = linkSoupParser.find('div', {'class': 'pagination'})
for list_of_links in page.find('a', href=True, text='next'):
  nextURL = 'https://securityadvisories.paloaltonetworks.com' + list_of_links.parent['href']
print (nextURL)

1 个答案:

答案 0 :(得分:4)

有几种不同的方法来处理分页。这是其中之一。

想法是初始化无限循环,并在没有“下一个”链接时将其打破:

from urllib.parse import urljoin

from bs4 import BeautifulSoup
import requests


with requests.Session() as session:
    page_number = 1
    url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
    while True:
        print("Processing page: #{page_number}; url: {url}".format(page_number=page_number, url=url))
        response = session.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')

        # check if there is next page, break if not
        next_link = soup.find("a", text="next")
        if next_link is None:
            break

        url = urljoin(url, next_link["href"])
        page_number += 1

print("Done.")

如果执行它,您将看到以下消息:

Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
Done.

请注意,为了提高性能并在请求中保留Cookie,我们正在使用requests.Session维护网络抓取会话。