我是python的新手,在下面的代码中:我有一个爬虫程序可以对新发现的链接进行处理。在根链接上递归后,似乎程序在打印几个链接后停止,这应该会持续一段时间,但事实并非如此。我正在捕捉和打印异常,但程序终止成功,所以我不确定为什么它会停止。
from urllib import urlopen
from bs4 import BeautifulSoup
def crawl(url, seen):
try:
if any(url in s for s in seen):
return 0
html = urlopen(url).read()
soup = BeautifulSoup(html)
for tag in soup.findAll('a', href=True):
str = tag['href']
if 'http' in str:
print tag['href']
seen.append(str)
print "--------------"
crawl(str, seen)
except Exception, e:
print e
return 0
def main ():
print "$ = " , crawl("http://news.google.ca", [])
if __name__ == "__main__":
main()
答案 0 :(得分:1)
for tag in soup.findAll('a', href=True):
str = tag['href']
if 'http' in str:
print tag['href']
seen.append(str) # you put the newly founded url to *seen*
print "--------------"
crawl(str, seen) # then you try to crawl it
但是,在crawl
if any(url in s for s in seen): # you don't crawl url in *seen*
return 0
当你真正抓取它时,你应该追加url
,而不是在找到它时。
答案 1 :(得分:0)
try:
if any(url in s for s in seen):
return 0
然后
seen.append(str)
print "--------------"
crawl(str, seen)
您将str
追加到seen
,然后以crawl
和str
作为参数调用seen
。显然你的代码退出了。你已经用这种方式设计了它。
更好的方法是抓取一个页面,将找到的所有链接添加到要抓取的列表中,然后继续抓取该列表中的所有链接。
简单来说,您应该先进行广度优先抓取,而不是进行深度优先抓取。
这样的事情应该有效。
from urllib import urlopen
from bs4 import BeautifulSoup
def crawl(url, seen, to_crawl):
html = urlopen(url).read()
soup = BeautifulSoup(html)
seen.append(url)
for tag in soup.findAll('a', href=True):
str = tag['href']
if 'http' in str:
if url not in seen and url not in to_crawl:
to_crawl.append(str)
print tag['href']
print "--------------"
crawl(to_crawl.pop(), seen, to_crawl)
def main ():
print "$ = " , crawl("http://news.google.ca", [], [])
if __name__ == "__main__":
main()
虽然您可能希望对其抓取的最大网址数量或最大网址数量进行限制。