我想使用Python和BeautifulSoup4搜索网站的几个页面。页面的URL只有一个数字,所以我实际上可以这样做一个声明:
theurl = "beginningofurl/" + str(counter) + "/endofurl.html"
我正在测试的链接是:
我的python脚本是this。
import urllib
import urllib.request
from bs4 import BeautifulSoup
def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''
pager = 1
while pager < 11:
theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")
for link in soup.findAll('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
print(sanitized)
print(writer)
print('---------------------------------------------------------')
pager += 1
category_crawler()
所以问题是:如何将while循环中的硬编码数字更改为一个解决方案,使脚本自动识别它通过了最后一页,然后自动退出?
答案 0 :(得分:2)
想法是有一个无限循环并在页面上没有“向右箭头”元素时将其打破这意味着你在最后一页,简单而且非常合乎逻辑:
import requests
from bs4 import BeautifulSoup
page = 1
url = "http://www.worldofquotes.com/topic/Nature/{page}/index.html"
with requests.Session() as session:
while True:
response = session.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
# TODO: parse the page and collect the results
if soup.find(class_="icon-arrow-right") is None:
break # last page
page += 1
答案 1 :(得分:0)
尝试使用requests
(避免重定向)并检查是否有新的引号。
import requests
from bs4 import BeautifulSoup
def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''
pager = 1
while pager < 11:
theurl = "http://www.worldofquotes.com/topic/Art/"+str(pager)+"/index.html"
thepage = requests.get(theurl, allow_redirects=False).text
soup = BeautifulSoup(thepage, "html.parser")
for link in soup.find_all('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
if not sanitized:
break
print(sanitized)
print(writer)
print('---------------------------------------------------------')
pager += 1
category_crawler()
答案 2 :(得分:0)
这是我的尝试。
次要问题:在代码中放置一个try-except
块,以防重定向引导您到达不存在的位置。
现在,主要问题是:如何避免解析已解析的内容。记录您已解析的网址。然后检测页面urllib
中的实际网址是否正在读取(使用geturl()
中的thepage
方法)已经被读取。在我的Mac OSX机器上工作。
注意:根据我从网站上看到的内容,总共有10页,这种方法不需要事先了解页面的HTML-它一般都适用。
import urllib
import urllib.request
from bs4 import BeautifulSoup
def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''
urlarchive = [];
pager = 1
while True:
theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
thepage = None;
try:
thepage = urllib.request.urlopen(theurl)
if thepage.geturl() in urlarchive:
break;
else:
urlarchive.append(thepage.geturl());
print(pager);
except:
break;
soup = BeautifulSoup(thepage, "html.parser")
for link in soup.findAll('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
print(sanitized)
print(writer)
print('---------------------------------------------------------')
pager += 1
category_crawler()