Question

我已经创建了一个网络抓取工具，它为链接中的所有网站提供链接和文本，如下所示：

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize

url = ["http://adbnews.com/area51"]


for u in url:
    br = mechanize.Browser()
    urls = [u]
    visited = [u]
    i = 0
    while i<len(urls):
        try:
            br.open(urls[0])
            urls.pop(0)

            for link in br.links():

                levelLinks = []
                linkText = [] 

                newurl = urlparse.urljoin(link.base_url, link.url)
                b1 = urlparse.urlparse(newurl).hostname
                b2 = urlparse.urlparse(newurl).path
                newurl = "http://"+b1+b2
                linkTxt = link.text
                linkText.append(linkTxt)
                levelLinks.append(newurl)


                if newurl not in visited and urlparse.urlparse(u).hostname in newurl:
                    urls.append(newurl)
                    visited.append(newurl)
                    #print newurl

                    #get Mechanize Links
                    for l,lt in zip(levelLinks,linkText):
                        print newurl,"\n",lt,"\n"


        except:
            urls.pop(0)

它得到的结果是：

http://www.adbnews.com/area51/contact.html 
CONTACT 

http://www.adbnews.com/area51/about.html 
ABOUT 

http://www.adbnews.com/area51/index.html 
INDEX 

http://www.adbnews.com/area51/1st/ 
FIRST LEVEL! 

http://www.adbnews.com/area51/1st/bling.html 
BLING 

http://www.adbnews.com/area51/1st/index.html 
INDEX 

http://adbnews.com/area51/2nd/ 
2ND LEVEL

我想添加一个可能会限制爬虫深度的反击的计数器。

我尝试添加例如steps = 3并更改while i<len(urls)中的while i<steps:

但这只会达到第一级甚至数字说3 ...

欢迎任何建议

Answer 1

如果您想搜索某个“深度”，请考虑使用递归函数，而不是仅添加URL列表。

def crawl(url, depth):
  if depth <= 3:
    #Scan page, grab links, title
    for link in links:
      print crawl(link, depth + 1)
  return url +"\n"+ title

这样可以更轻松地控制递归搜索，并且速度更快，资源更少：）

添加计数器到我的Python Web Crawler

1 个答案: