Question

我需要一个关于如何实际迭代和解析网站上多个页面的具体答案，其中URL已知但只是在一定程度上。我查看了许多教程，但没有人告诉我实际如何进入下一页 - 也许我需要使用正则表达式。我想知道是否有人可以给我一些建议或开始在哪里寻找这个：因此，我将举一个例子 - 让我们使用Python网站 - 我们非常了解它们：https://docs.python.org/3/tutorial/ 在此页面上，您将看到有＆＃34; next＆＃34;继续页面的按钮：https://docs.python.org/3/tutorial/appetite.html 如果你点击＆＃34; next＆＃34;只按下最后一次/*.html更改。

我希望得到的建议是如何实际完成所有最后的/*.html迭代并捕获这些html页面。

Answer 1

由于href值都与当前网址相关，因此您无法简单地检查href属性是否以https://docs.python.org/3/tutorial/开头。请注意，这些链接包含reference和internal类，请使用：

soup.find_all("a", class_=["reference", "internal"])
soup.select("a.reference.internal")  # CSS selector to check multiple classes

以下是一个示例工作代码，用于提取页面的href值：

from urlparse import urljoin

import requests
from bs4 import BeautifulSoup


base_url = "https://docs.python.org/3/tutorial/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "html.parser")

for link in soup.select("a.reference.internal"):
    url = link["href"]
    absolute_url = urljoin(base_url, url)

    print(url, absolute_url)

请注意，我们必须使用.urljoin()来获取绝对网址，以便我们可以关注它们。

Answer 2

alecxe的回答很好，基本上是这个答案的下半部分，但它重复了一页。例如，网址https://docs.python.org/3/tutorial/inputoutput.html和https://docs.python.org/3/tutorial/inputoutput.html#old-string-formatting实际上是同一页面，第二个网页只是页面上的锚点。

如果你想像最初所说的那样做 - 找到＆＃34; next＆＃34;的价值。链接＆＃39; s href，然后在那里导航 - 你可以这样做：

使用正则表达式使用＆＃34; next＆＃34;找到div s在他们，然后使用他们的父母获得实际的href。使用urljoin()将base_url和href连接在一起以获取下一页的绝对URL。

import re
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin


BASE_URL = "https://docs.python.org/3/tutorial/"

def get_next_url(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    selected = soup.select('div.related h3')
    nav = selected[-1] if selected else None# grab the last one with this css selector
    if nav:
        href = nav.parent.find('a', text=re.compile('next'))['href']
        new_url = urljoin(BASE_URL, href)
        return new_url
    else:
        return None

next = get_next_url(BASE_URL)
while next:
    old = next
    next = get_next_url(old)

Answer 3

这是我递归搜索Python教程页面的函数版本。它更短，我认为更清楚。

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup as bs

base_url = 'https://docs.python.org/3/tutorial/'

def find_pages(url):
    """Loop over all pages in online Python tutorial."""
    # try open url
    try:
        page = urlopen(url).read()
    # quit if there's no Next link
    except HTTPError:
        print("The end!")
        return

    # parse the page
    soup = bs(page, 'html.parser')

    # find all occurences of the links, that contain text 'next' and have no attributes
    next_url = soup.findAll('a', text = "next", attrs = {'accesskey' : ''})[0].get('href')

    # do something meaningful with the scrapped page here
    print(next_url)

    # recur with the newly obtained next page's url
    find_pages(base_url + next_url)

find_pages(base_url)

该计划可分为以下几个部分：

使用urllib获取页面的HTML代码（如果您正在使用BeautifulSoup，那么值得学习urllib！）
使用BS解析页面
找到一个包含“＆＃39; next”字样的链接。（请参阅BS＆＃39; s docs）
如果需要，可以对页面执行某些操作（我只是打印链接名称）
执行上述所有步骤，但是下一页，直到没有下一页为止

在Python中测试的代码3.快乐的黑客攻击和学习！

Answer 4

您需要逐个关注它们。或者您可以从索引中获取链接。例如，页面：https://docs.python.org/3/tutorial/包含您按照next按钮时将要经历的所有链接。所以你可以从这一个地方抓住它们。

您必须决定如何做到最好。这通常需要分析链接结构并给出一些想法。

Python BeautifulSoup4 Web在一个网站上搜索多个页面

4 个答案: