假设我有一个网站:
现在使用时:
from bs4 import BeautifulSoup
def visit(url, recursion=0):
links = getpage_and_retrieve_ahref_links(url) # using beautifulsoup
for link in links:
if recursion < 10: # limit recursion to 10
visit(link, recursion+1)
visit('http://example.com/home')
在访问/ page2之前,它将访问/ home,/ page1,/ page2368,/ page999990,/ page999991,...,/ page999999。简而言之,它正在进行深度优先遍历而不是(我想要的)广度优先遍历。
如何修改以前的代码以进行广度优先访问,即首先使用visit
进行所有calls recursion=1
,然后使用visit
进行recursion=2
次调用等。吗
应按此顺序访问/ home,/ page1,/ page2,/ page3,/ page2368,/ page41,/ page999990等。
答案 0 :(得分:1)
您可以使用双端队列(Python中的deque
)通过在n级链接之后附加n + 1级别的链接来进行广度优先搜索。
from collections import deque
def bfs_visit(url, max_level=10):
queue = deque([url])
level = 0
while queue and level < max_level:
url = queue.popleft()
visit_no_recur(url) # only visits the page
links = get_links(url) # get links, maybe parse the result of last statement
queue.append(links)
level += 1
bfs_visit('http://example.com/home')
在给定的示例中,队列将如下所示:
['/home']
=> popleft /home (i.e. the next page to be visited is /home)
=> add new links on right
['/page1', '/page2', '/page3']
=> popleft /page1
=> add new links on right
['/page2', '/page3', '/page2368']
=> popleft /page2
=> add new links on right
['/page3', '/page2368', '/page41']
=> popleft /page3
['/page2368', '/page41']
=> popleft /page2368
=> add new links on right
['/page41', '/page999990', '/page999991', ..., '/page999999']
...