使用BeautifulSoup

时间:2016-09-12 00:48:21

标签: python beautifulsoup

我想使用Python和BeautifulSoup4搜索网站的几个页面。页面的URL只有一个数字,所以我实际上可以这样做一个声明:

theurl = "beginningofurl/" + str(counter) + "/endofurl.html"

我正在测试的链接是:

我的python脚本是this

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''

    pager = 1

    while pager < 11:
        theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
        thepage = urllib.request.urlopen(theurl)
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.findAll('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

所以问题是:如何将while循环中的硬编码数字更改为一个解决方案,使脚本自动识别它通过了最后一页,然后自动退出?

3 个答案:

答案 0 :(得分:2)

想法是有一个无限循环并在页面上没有“向右箭头”元素时将其打破这意味着你在最后一页,简单而且非常合乎逻辑:

import requests
from bs4 import BeautifulSoup


page = 1
url = "http://www.worldofquotes.com/topic/Nature/{page}/index.html"
with requests.Session() as session:
    while True:
        response = session.get(url.format(page=page))
        soup = BeautifulSoup(response.content, "html.parser")

        # TODO: parse the page and collect the results

        if soup.find(class_="icon-arrow-right") is None:
            break  # last page

        page += 1

答案 1 :(得分:0)

尝试使用requests(避免重定向)并检查是否有新的引号。

import requests
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''

    pager = 1

    while pager < 11:
        theurl = "http://www.worldofquotes.com/topic/Art/"+str(pager)+"/index.html"
        thepage = requests.get(theurl, allow_redirects=False).text
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.find_all('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            if not sanitized:
                break
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

答案 2 :(得分:0)

这是我的尝试。

次要问题:在代码中放置一个try-except块,以防重定向引导您到达不存在的位置。

现在,主要问题是:如何避免解析已解析的内容。记录您已解析的网址。然后检测页面urllib中的实际网址是否正在读取(使用geturl()中的thepage方法)已经被读取。在我的Mac OSX机器上工作。

注意:根据我从网站上看到的内容,总共有10页,这种方法不需要事先了解页面的HTML-它一般都适用。

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''
    urlarchive = [];
    pager = 1
    while True:
        theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
        thepage = None;
        try:
            thepage = urllib.request.urlopen(theurl)
            if thepage.geturl() in urlarchive:
                break;
            else:
                urlarchive.append(thepage.geturl());
                print(pager);
        except:
            break;
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.findAll('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()