Question

我想使用Python和BeautifulSoup4搜索网站的几个页面。页面的URL只有一个数字，所以我实际上可以这样做一个声明：

theurl = "beginningofurl/" + str(counter) + "/endofurl.html"

我正在测试的链接是：

我的python脚本是this。

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''

    pager = 1

    while pager < 11:
        theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
        thepage = urllib.request.urlopen(theurl)
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.findAll('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

所以问题是：如何将while循环中的硬编码数字更改为一个解决方案，使脚本自动识别它通过了最后一页，然后自动退出？

Answer 1

想法是有一个无限循环并在页面上没有“向右箭头”元素时将其打破这意味着你在最后一页，简单而且非常合乎逻辑：

import requests
from bs4 import BeautifulSoup


page = 1
url = "http://www.worldofquotes.com/topic/Nature/{page}/index.html"
with requests.Session() as session:
    while True:
        response = session.get(url.format(page=page))
        soup = BeautifulSoup(response.content, "html.parser")

        # TODO: parse the page and collect the results

        if soup.find(class_="icon-arrow-right") is None:
            break  # last page

        page += 1

Answer 2

尝试使用requests（避免重定向）并检查是否有新的引号。

import requests
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''

    pager = 1

    while pager < 11:
        theurl = "http://www.worldofquotes.com/topic/Art/"+str(pager)+"/index.html"
        thepage = requests.get(theurl, allow_redirects=False).text
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.find_all('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            if not sanitized:
                break
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

Answer 3

这是我的尝试。

次要问题：在代码中放置一个try-except块，以防重定向引导您到达不存在的位置。

现在，主要问题是：如何避免解析已解析的内容。记录您已解析的网址。然后检测页面urllib中的实际网址是否正在读取（使用geturl()中的thepage方法）已经被读取。在我的Mac OSX机器上工作。

注意：根据我从网站上看到的内容，总共有10页，这种方法不需要事先了解页面的HTML-它一般都适用。

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''
    urlarchive = [];
    pager = 1
    while True:
        theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
        thepage = None;
        try:
            thepage = urllib.request.urlopen(theurl)
            if thepage.geturl() in urlarchive:
                break;
            else:
                urlarchive.append(thepage.geturl());
                print(pager);
        except:
            break;
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.findAll('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

使用BeautifulSoup

3 个答案: