Question

嘿伙计这是我的第一篇文章。我是一个营销人员（ewww），我是Python的新手，所以请不要拍我。

我正在通过反复试验来学习，像这样破解脚本。

有谁能告诉我如何遍历网站的所有页面，然后打印每个网址的信息？

url = "http://example.com"

urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()


# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
try:
    htmltext = urllib.urlopen(urls[0]).read()

# Except for visited urls
except:
    print urls[0]

# Get and Print Information
soup = BeautifulSoup(htmltext, "lxml")
urls.pop(0)
info = soup.findAll(['title', 'h1', 'h2', 'p'])
for script in soup("script"):
soup.script.extract()

print info

# Number of URLs in stack
print len(urls)

# Append Incomplete Tags
for tag in soup.findAll('a',href=True):
    tag['href'] = urlparse.urljoin(url,tag['href'])
    if url in tag['href'] and tag['href'] not in visited:
        urls.append(tag['href'])
        visited.append(tag['href'])

Answer 1

评论：

我看到visited提到了两次。在一次实例中，它用作url列表，在另一个实例中用作HTML元素列表。这非常顽皮。
在while循环中，每个网址都会从其网站上征集，并在htmltext内读取exception。请注意，每次通过循环时，htmltext的先前内容将被覆盖并丢失。每次htmltext可用时都必须调用BeautifulSoup，然后在再次创建soup之前处理BeautifulSoup中的soup。

按照你的风格，我会以这种方式编写这样的代码。

import requests
import bs4

urls = ['url_1', 'url_2', 'url_3', 'url_4', 'url_5', 'url_6', 'url_7', 'url_8', 'url_9', 'url_10']

while urls:
    url = urls.pop(0)
    print (url)
    try:
        htmltext = requests.get(url).content
    except:
        print ('*** attempt to open '+url+' failed')
        continue
    soup = bs4.BeautifulSoup(htmltext, 'lxml')
    title = soup.find('title')
    print (title)

我使用请求库而不是urllib，因为它通常会让生活更轻松。
由于您使用pop列表的urls方法删除其项目，因此我们无需记录visited个网址。当它们从urls中消失时，列表变得更短并最终变空。
while urls询问urls是否为空。
此代码的要点是它显示了如何询问远程站点并调用BeautifulSoup是如何循环遍历列表urls。

Python与美丽的汤不循环页面。

1 个答案: