Question

我正在创建一个程序，它使用字典在Python中存储大量的Web链接树。基本上，您从根URL开始，并根据从根的HTML中找到的URL创建字典。在下一步中，我想获取每个URL的页面并获取这些URL上的链接。最后，我想要一本包含其中所有链接的字典以及它们之间的关系。

这就是我前两个深度的内容

for link in soup.find_all('a'):
   url = link.get('href')
   url_tree[siteurl][url]
 #Get page source
   for link in soup.find_all('a'):
     url = link.get('href')
     url_tree[siteurl][secondurl][url]

这个系统有效，但正如你所知，如果我想要一个N层深度的字典，那就会成为很多代码块。有没有办法自动添加更多图层？任何帮助表示赞赏！

Answer 1

这可以使用递归函数来完成。

一个基本示例，它将逐页抓取页面中找到的所有网址，然后逐个抓取该网页中找到的所有网址，依此类推......它还会打印出找到的每个网址。< / p>

def recursive_fetch(url_to_fetch):
    # get page source from url_to_fetch
    # make a new soup
    for link in soup.find_all('a'):
        url = link.get('href')
        print url
        # run the function recursively 
        # for the current url
        recursive_fetch(url)


# Usage
recursive_fetch(root_url)

由于你想要找到所有网址的树的字典，上面的代码并没有多大帮助，但这只是一个开始。

这是它变得非常复杂的地方。因为现在您还需要跟踪被抓取的当前网址的父级，该网址的父级，该网址的父级，该网址的父级，...的父级。

你看，我的意思是什么？它变得非常复杂，非常快。

以下是执行所有操作的代码。我在代码中写了一些评论，尽可能地解释它。但是你需要真正理解递归函数如何有助于更好地理解。

首先，让我们看看另一个函数，它将非常有助于从树中获取URL的父级：

def get_parent(tree, parent_list):
    """How it works:

    Let's say the `tree` looks like this:

        tree = {
            'root-url': {
                'link-1': {
                    'link-1-a': {...}
                }
            }
        }

    and `parent_list` looks like this:

        parent_list = ['root-url', 'link-1', 'link-1-a']

    this function will chain the values in the list and 
    perform a dict lookup like this:

        tree['root-url']['link-1']['link-1-a']
    """

    first, rest = parent_list[0], parent_list[1:]
    try:
        if tree[first] and rest:
            # if tree or rest aren't empty
            # run the function recursively
            return get_parent(tree[first], rest)
        else:
            return tree[first]
    except KeyError:
        # this is required for creating the 
        # root_url dict in the tree
        # because it doesn't exist yet

        tree[first] = {}
        return tree[first]

recursive_fetch函数将如下所示：

url_tree = {} # dict to store the url tree


def recursive_fetch(fetch_url, parents=None):
    """
    `parents` is a list of parents of the current url
     Example:
         parents = ['root-url', 'link-1', ... 'parent-link']
    """
    parents = parents or []

    parents.append(fetch_url)

    # get page source from fetch_url
    # make new soup object

    for link in soup.find_all('a'):
        url = link.get('href')

        if parents:
            parent = get_parent(url_tree, parents)
        else:
            parent = None

        if parent is not None:
            # this will run when parent is not None
            # i.e. even if parent is empty dict {}

            # create a new dict of the current url
            # inside the parent dict
            parent[url] = {}
        else:
            # this url has no parent, 
            # insert this directly in the url_tree
            url_tree[url] = {}

        # now crawl the current url
        recursive_fetch(url, parents)

    # Next is the most import block of code
    # Whenever 1 recursion completes, 
    # it will pop the last parent from 
    # the `parents` list so that in the 
    # next recursion, the parents are correct.
    # Whithout this block, the url_tree wouldn't 
    # look as expected.
    # It took me many hours to figure this out

    try:
        parents.pop(-1)
    except IndexError:
        pass

在Python中自动将图层添加到字典

1 个答案: