我想在每个页面中获取所有应用程序链接。但问题是每个类别中的总页面不相同。 我有这段代码:
import urllib
from bs4 import BeautifulSoup
url ='http://www.brothersoft.com/windows/mp3_audio/'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
for page in range(1,27+1):
content = urllib.urlopen(suburl+'{}.html'.format(page))
soup = BeautifulSoup(content)
for a in soup.select('div.freeText dl a[href]'):
print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
但是我只在每个类别中获得27页的申请链接。 如果其他类别没有27页或超过27页怎么办?
答案 0 :(得分:1)
您可以提取多少个程序并将其除以20.例如,如果您打开网址 - http://www.brothersoft.com/windows/photo_image/font_tools/2.html
,那么:
import re
import urllib
from bs4 import BeautifulSoup
tmp = re.compile("1-(..)")
url ='http://www.brothersoft.com/windows/photo_image/font_tools/2.html'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
pages = soup.find("div", {"class":"freemenu coLeft Menubox"})
page = pages.text
print int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1
输出将是:
18
对于http://www.brothersoft.com/windows/photo_image/cad_software/6.html,网址输出为108
。
所以你需要打开一些页面,你可以在那里找到多少页面。废弃这个数字,然后你可以运行你的循环。它可能是这样的:
import re
import urllib
from bs4 import BeautifulSoup
tmp = re.compile("1-(..)")
url ='http://www.brothersoft.com/windows/photo_image/'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
print suburl
content = urllib.urlopen(suburl+'2.html')
soup1 = BeautifulSoup(content)
pages = soup1.find("div", {"class":"freemenu coLeft Menubox"})
page = pages.text
allPages = int(re.search(r'of ([\d]+) ', page).group(1)) / 20 + 1
print allPages
for page in range(1, allPages+1):
content = urllib.urlopen(suburl+'{}.html'.format(page))
soup = BeautifulSoup(content)
for a in soup.select('div.freeText dl a[href]'):
print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')