如何在未知的总页数内迭代链接?

时间:2013-08-28 09:55:40

标签: python-2.7 beautifulsoup

我想在每个页面中获取所有应用程序链接。但问题是每个类别中的总页面不相同。 我有这段代码:

import urllib
from bs4 import BeautifulSoup

url ='http://www.brothersoft.com/windows/mp3_audio/'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)

for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
        print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
        suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')

        for page in range(1,27+1):
                content = urllib.urlopen(suburl+'{}.html'.format(page))
                soup = BeautifulSoup(content)
                for a in soup.select('div.freeText dl a[href]'):
                        print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')

但是我只在每个类别中获得27页的申请链接。 如果其他类别没有27页或超过27页怎么办?

1 个答案:

答案 0 :(得分:1)

您可以提取多少个程序并将其除以20.例如,如果您打开网址 - http://www.brothersoft.com/windows/photo_image/font_tools/2.html,那么:

import re
import urllib
from bs4 import BeautifulSoup

tmp = re.compile("1-(..)")
url ='http://www.brothersoft.com/windows/photo_image/font_tools/2.html'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
pages = soup.find("div", {"class":"freemenu coLeft Menubox"})
page = pages.text
print int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1

输出将是:

18

对于http://www.brothersoft.com/windows/photo_image/cad_software/6.html,网址输出为108

所以你需要打开一些页面,你可以在那里找到多少页面。废弃这个数字,然后你可以运行你的循环。它可能是这样的:

import re
import urllib
from bs4 import BeautifulSoup

tmp = re.compile("1-(..)")
url ='http://www.brothersoft.com/windows/photo_image/'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)

for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
        suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
        print suburl

        content = urllib.urlopen(suburl+'2.html')
        soup1 = BeautifulSoup(content)
        pages = soup1.find("div", {"class":"freemenu coLeft Menubox"})
        page = pages.text
        allPages =  int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1
        print allPages
        for page in range(1, allPages+1):
                content = urllib.urlopen(suburl+'{}.html'.format(page))
                soup = BeautifulSoup(content)
                for a in soup.select('div.freeText dl a[href]'):
                        print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')