广度优先链接抓取

时间:2018-01-08 11:17:58

标签: python recursion beautifulsoup depth-first-search breadth-first-search

假设我有一个网站:

现在使用时:

from bs4 import BeautifulSoup

def visit(url, recursion=0):
    links = getpage_and_retrieve_ahref_links(url)   # using beautifulsoup
    for link in links:
        if recursion < 10:         # limit recursion to 10 
            visit(link, recursion+1)

visit('http://example.com/home')

在访问/ page2之前,它将访问/ home,/ page1,/ page2368,/ page999990,/ page999991,...,/ page999999。简而言之,它正在进行深度优先遍历而不是(我想要的)广度优先遍历。

如何修改以前的代码以进行广度优先访问,即首先使用visit进行所有calls recursion=1,然后使用visit进行recursion=2次调用等。吗

应按此顺序访问/ home,/ page1,/ page2,/ page3,/ page2368,/ page41,/ page999990等。

1 个答案:

答案 0 :(得分:1)

您可以使用双端队列(Python中的deque)通过在n级链接之后附加n + 1级别的链接来进行广度优先搜索。

from collections import deque

def bfs_visit(url, max_level=10):
    queue = deque([url])
    level = 0

    while queue and level < max_level:
        url = queue.popleft()
        visit_no_recur(url)  # only visits the page

        links = get_links(url)  # get links, maybe parse the result of last statement
        queue.append(links)
        level += 1 

bfs_visit('http://example.com/home')

在给定的示例中,队列将如下所示:

['/home']  
    => popleft /home (i.e. the next page to be visited is /home)
    => add new links on right
['/page1', '/page2', '/page3'] 
    => popleft /page1
    => add new links on right
['/page2', '/page3', '/page2368']
    => popleft /page2
    => add new links on right
['/page3', '/page2368', '/page41']
    => popleft /page3
['/page2368', '/page41']
    => popleft /page2368
    => add new links on right
['/page41', '/page999990', '/page999991', ..., '/page999999']
    ...