使用python进行网页抓取分页时迭代多个页面

时间:2021-07-02 22:37:34

标签: python web-scraping beautifulsoup pagination

我正在尝试从通常包含多个页面的网页中提取特定数据。虽然我能够在第一页上打印我需要的所有信息,但我无法对其他页面执行相同的操作。我在互联网上搜索了解决方案,发现大多数解决方案都是通过将链接页面与数字连接起来来循环遍历每个页面的。

但是,我正在开发一个网站,当您导航到不同页面时,该网站的链接页面不会更改。因此,我很难弄清楚是哪个属性导致 URL 重定向到第二页,因为没有显示可点击的链接。

当我检查相似的下一个按钮时,我得到以下信息:

<div class="pagination__PageNavItem-s1515b5x-2 clogRN"><span class="pagination__PageNavigation-s1515b5x-3 cKpakR">→</span></div>

我能够在此处获得第一页所需的信息:

import requests
from bs4 import BeautifulSoup


url = 'https://www.flightstats.com/v2/flight-tracker/arrivals/LHR/?year=2021&month=7&date=3&hour=12?page=12323213' 
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')

airline_text = soup.find_all('div', {"class": "table__Cell-s1x7nv9w-13 iZEpOT"})

for n, i in enumerate(airline_text, start=1):
    print(n, '->', i.get_text())

有没有办法遍历剩余的页面?

2 个答案:

答案 0 :(得分:1)

有一个包含所需信息的 script 标签,我使用正则表达式对其进行了解析,其中名为 name 的属性包含航空公司名称。

import requests
import re
from pprint import pp


def main(url):
    params = {
        "year": "2021",
        "month": "7",
        "date": "3",
        "hour": "12"
    }
    r = requests.get(url, params=params)
    match = re.findall(r'"name":"(.*?)"', r.text)
    pp(match)


main('https://www.flightstats.com/v2/flight-tracker/arrivals/LHR/')

输出:

['London Heathrow Airport',
 'Qatar Airways',
 'British Airways',
 'American Airlines',
 'Aer Lingus',
 'Qatar Airways',
 'British Airways',
 'American Airlines',
 'JAL',
 'British Airways',
 'British Airways',
 'American Airlines',
 'Emirates',
 'Qantas',
 'British Airways',
 'Iberia',
 'British Airways',
 'American Airlines',
 'Iberia',
 'Qatar Airways',
 'Royal Jordanian',
 'Finnair',
 'Qatar Airways',
 'British Airways',
 'Qatar Airways',
 'Iberia',
 'American Airlines',
 'British Airways',
 'SWISS',
 'Air Canada',
 'United Airlines',
 'British Airways',
 'ANA',
 'Aegean Airlines',
 'United Airlines',
 'American Airlines',
 'Finnair',
 'Iberia',
 'Qatar Airways',
 'United Airlines',
 'British Airways',
 'Lufthansa',
 'Aer Lingus',
 'Air Canada',
 'British Airways',
 'Etihad Airways',
 'British Airways',
 'Qatar Airways',
 'American Airlines',
 'Iberia',
 'Qatar Airways',
 'Gulf Air',
 'Fiji Airways',
 'British Airways',
 'Finnair',
 'Alaska Airlines',
 'Royal Jordanian',
 'EL AL',
 'Royal Jordanian',
 'British Airways',
 'American Airlines',
 'Iberia',
 'Qatar Airways',
 'American Airlines',
 'Xiamen Airlines',
 'Iberia',
 'British Airways',
 'Qatar Airways',
 'British Airways',
 'American Airlines',
 'Iberia',
 'JAL',
 'JAL',
 'American Airlines',
 'British Airways',
 'British Airways',
 'United Airlines',
 'ANA',
 'Iberia',
 'Malaysia Airlines',
 'Qatar Airways',
 'Royal Jordanian',
 'American Airlines',
 'Finnair',
 'SWISS',
 'British Airways',
 'American Airlines',
 'Finnair',
 'Aer Lingus',
 'Iberia',
 'Kuwait Airways',
 'Xiamen Airlines',
 'Garuda Indonesia',
 'American Airlines',
 'British Airways',
 'Malaysia Airlines',
 'China Airlines',
 'KLM',
 'Gol',
 'Virgin Atlantic',
 'Delta Air Lines',
 'American Airlines',
 'Cathay Pacific',
 'British Airways',
 'British Airways',
 'JAL',
 'Qatar Airways',
 'Finnair',
 'Pakistan International Airlines',
 'United Airlines',
 'Air Canada',
 'EgyptAir',
 'TAP Air Portugal',
 'British Airways',
 'TAROM',
 'British Airways',
 'American Airlines',
 'Qatar Airways',
 'Delta Air Lines',
 'Iberia',
 'Air France',
 'British Airways',
 'Aeromexico',
 'KLM',
 'Virgin Atlantic',
 'Singapore Airlines',
 'British Airways',
 'JAL',
 'American Airlines',
 'Aer Lingus',
 'British Airways',
 'British Airways',
 'British Airways',
 'British Airways',
 'British Airways',
 'American Airlines',
 'British Airways',
 'Lufthansa',
 'American Airlines',
 'United Airlines',
 'Croatia Airlines',
 'Malaysia Airlines',
 'JAL',
 'Iberia',
 'Finnair',
 'Aegean Airlines',
 'Cathay Pacific',
 'British Airways',
 'British Airways',
 'American Airlines',
 'Finnair',
 'British Airways',
 'Malaysia Airlines',
 'American Airlines',
 'Cathay Pacific',
 'Emirates',
 'Saudia',
 'American Airlines',
 'Cathay Pacific',
 'LATAM Airlines',
 'British Airways',
 'British Airways',
 'Qatar Airways',
 'Cathay Pacific',
 'Iberia',
 'Gulf Air',
 'British Airways',
 'Finnair',
 'Qatar Airways',
 'Royal Jordanian',
 'Royal Jordanian',
 'American Airlines',
 'British Airways',
 'American Airlines',
 'Malaysia Airlines',
 'British Airways',
 'Iberia',
 'American Airlines',
 'Singapore Airlines',
 'American Airlines',
 'British Airways',
 'TAP Air Portugal',
 'Aegean Airlines',
 'British Airways',
 'Iberia',
 'Azores Airlines',
 'TAP Air Portugal',
 'TAP Air Portugal',
 'Singapore Airlines',
 'Air New Zealand',
 'Air Canada',
 'Virgin Atlantic',
 'SAS',
 'British Airways',
 'British Airways',
 'JAL',
 'Croatia Airlines',
 'Royal Air Maroc',
 'Finnair',
 'British Airways',
 'LATAM Airlines',
 'Malaysia Airlines',
 'British Airways',
 'American Airlines',
 'Finnair',
 'Aer Lingus',
 'Iberia',
 'Iberia',
 'Qatar Airways',
 'Aer Lingus',
 'Air Canada',
 'British Airways',
 'United Airlines',
 'Aeroflot',
 'AZAL Azerbaijan Airlines',
 'Etihad Airways',
 'Iberia',
 'Turkish Airlines',
 'British Airways',
 'American Airlines',
 'Qantas',
 'JAL',
 'American Airlines',
 'British Airways',
 'Delta Air Lines',
 'Alitalia',
 'British Airways',
 'LATAM Airlines',
 'KLM',
 'Garuda Indonesia',
 'Virgin Atlantic',
 'Qatar Airways',
 'Qantas',
 'Malaysia Airlines',
 'Gol',
 'JAL',
 'Iberia',
 'Aer Lingus',
 'China Southern Airlines',
 'Xiamen Airlines',
 'British Airways',
 'Delta Air Lines',
 'Alitalia',
 'Kenya Airways',
 'Delta Air Lines',
 'Virgin Atlantic',
 'American Airlines',
 'British Airways',
 'Biman Bangladesh Airlines',
 'ANA',
 'Kenya Airways',
 'Air France',
 'Aeromexico',
 'Gol',
 'Virgin Atlantic',
 'British Airways',
 'Qatar Airways',
 'British Airways',
 'Iberia',
 'British Airways',
 'Royal Air Maroc',
 'British Airways',
 'Iberia',
 'Qatar Airways',
 'American Airlines',
 'SriLankan Airlines',
 'JAL',
 'British Airways',
 'American Airlines',
 'Finnair',
 'Iberia',
 'British Airways',
 'JAL',
 'LATAM Airlines',
 'British Airways',
 'American Airlines',
 'Qatar Airways',
 'British Airways',
 'British Airways',
 'JAL',
 'British Airways',
 'JAL',
 'American Airlines',
 'SWISS',
 'Etihad Airways',
 'British Airways',
 'British Airways',
 'British Airways',
 'Aer Lingus',
 'Saudia',
 'Ethiopian Airlines',
 'TAP Air Portugal',
 'Singapore Airlines',
 'United Airlines',
 'Azores Airlines',
 'ANA',
 'EgyptAir',
 'EL AL',
 'Etihad Airways',
 'Korean Air',
 'Royal Air Maroc',
 'London Heathrow Airport'] 

答案 1 :(得分:0)

数据存储在页面内的 <script> 标签中。您可以使用下一个示例如何提取它:

import re
import json
import requests


url = "https://www.flightstats.com/v2/flight-tracker/arrivals/LHR/?year=2021&month=7&date=3&hour=12?page=12323213"
html_page = requests.get(url).text

data = re.search(r"__NEXT_DATA__ = (.*)", html_page).group(1)
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for f in data["props"]["initialState"]["flightTracker"]["route"]["flights"]:
    print(
        "{:<8} {:<8} {:<3} {:<5}".format(
            f["departureTime"]["time24"],
            f["arrivalTime"]["time24"],
            f["carrier"]["fs"],
            f["carrier"]["flightNumber"],
        )
    )

打印:

07:10    12:10    QR  8866 
10:45    12:15    BA  827  
10:45    12:15    AA  6472 
10:45    12:15    EI  8327 
10:45    12:15    QR  5952 
11:00    12:20    BA  579  
11:00    12:20    AA  6838 
11:00    12:20    JL  7156 

...