我正在通过反复试验来学习,像这样破解脚本。
有谁能告诉我如何遍历网站的所有页面,然后打印每个网址的信息?
url = "http://example.com"
urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()
# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
try:
htmltext = urllib.urlopen(urls[0]).read()
# Except for visited urls
except:
print urls[0]
# Get and Print Information
soup = BeautifulSoup(htmltext, "lxml")
urls.pop(0)
info = soup.findAll(['title', 'h1', 'h2', 'p'])
for script in soup("script"):
soup.script.extract()
print info
# Number of URLs in stack
print len(urls)
# Append Incomplete Tags
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
答案 0 :(得分:0)
评论:
visited
提到了两次。在一次实例中,它用作url列表,在另一个实例中用作HTML元素列表。这非常顽皮。while
循环中,每个网址都会从其网站上征集,并在htmltext
内读取exception
。请注意,每次通过循环时,htmltext
的先前内容将被覆盖并丢失。每次htmltext
可用时都必须调用BeautifulSoup,然后在再次创建soup
之前处理BeautifulSoup中的soup
。按照你的风格,我会以这种方式编写这样的代码。
import requests
import bs4
urls = ['url_1', 'url_2', 'url_3', 'url_4', 'url_5', 'url_6', 'url_7', 'url_8', 'url_9', 'url_10']
while urls:
url = urls.pop(0)
print (url)
try:
htmltext = requests.get(url).content
except:
print ('*** attempt to open '+url+' failed')
continue
soup = bs4.BeautifulSoup(htmltext, 'lxml')
title = soup.find('title')
print (title)
pop
列表的urls
方法删除其项目,因此我们无需记录visited
个网址。当它们从urls
中消失时,列表变得更短并最终变空。 while urls
询问urls
是否为空。urls
。