我目前正在通过预设名为number_of_pages的变量来抓取特定网站的所有页面。预设此变量有效,直到添加了我不了解的新页面。例如,下面的代码是3页,但网站现在有4页。
base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
number_of_pages = 3
for i in range(1, number_of_pages, 1):
url_to_scrape = (base_url + str(i))
我想使用BeautifulSoup查找网站上的所有下一个链接。下面的代码找到第二个URL,但不是第三个或第四个URL。如何在抓取它们之前构建所有页面的列表?
base_url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
CrawlRequest = requests.get(base_url)
raw_html = CrawlRequest.text
linkSoupParser = BeautifulSoup(raw_html, 'html.parser')
page = linkSoupParser.find('div', {'class': 'pagination'})
for list_of_links in page.find('a', href=True, text='next'):
nextURL = 'https://securityadvisories.paloaltonetworks.com' + list_of_links.parent['href']
print (nextURL)
答案 0 :(得分:4)
有几种不同的方法来处理分页。这是其中之一。
想法是初始化无限循环,并在没有“下一个”链接时将其打破:
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
page_number = 1
url = 'https://securityadvisories.paloaltonetworks.com/Home/Index/?page='
while True:
print("Processing page: #{page_number}; url: {url}".format(page_number=page_number, url=url))
response = session.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# check if there is next page, break if not
next_link = soup.find("a", text="next")
if next_link is None:
break
url = urljoin(url, next_link["href"])
page_number += 1
print("Done.")
如果执行它,您将看到以下消息:
Processing page: #1; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=
Processing page: #2; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=2
Processing page: #3; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=3
Processing page: #4; url: https://securityadvisories.paloaltonetworks.com/Home/Index/?page=4
Done.
请注意,为了提高性能并在请求中保留Cookie,我们正在使用requests.Session
维护网络抓取会话。