我已经创建了一个网络抓取工具,它为链接中的所有网站提供链接和文本,如下所示:
import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
url = ["http://adbnews.com/area51"]
for u in url:
br = mechanize.Browser()
urls = [u]
visited = [u]
i = 0
while i<len(urls):
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
levelLinks = []
linkText = []
newurl = urlparse.urljoin(link.base_url, link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
newurl = "http://"+b1+b2
linkTxt = link.text
linkText.append(linkTxt)
levelLinks.append(newurl)
if newurl not in visited and urlparse.urlparse(u).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
#print newurl
#get Mechanize Links
for l,lt in zip(levelLinks,linkText):
print newurl,"\n",lt,"\n"
except:
urls.pop(0)
它得到的结果是:
http://www.adbnews.com/area51/contact.html
CONTACT
http://www.adbnews.com/area51/about.html
ABOUT
http://www.adbnews.com/area51/index.html
INDEX
http://www.adbnews.com/area51/1st/
FIRST LEVEL!
http://www.adbnews.com/area51/1st/bling.html
BLING
http://www.adbnews.com/area51/1st/index.html
INDEX
http://adbnews.com/area51/2nd/
2ND LEVEL
我想添加一个可能会限制爬虫深度的反击的计数器。
我尝试添加例如steps = 3
并更改while i<len(urls)
中的while i<steps:
但这只会达到第一级甚至数字说3 ...
欢迎任何建议
答案 0 :(得分:0)
如果您想搜索某个“深度”,请考虑使用递归函数,而不是仅添加URL列表。
def crawl(url, depth):
if depth <= 3:
#Scan page, grab links, title
for link in links:
print crawl(link, depth + 1)
return url +"\n"+ title
这样可以更轻松地控制递归搜索,并且速度更快,资源更少:)