我正在创建一个程序,它使用字典在Python中存储大量的Web链接树。基本上,您从根URL开始,并根据从根的HTML中找到的URL创建字典。在下一步中,我想获取每个URL的页面并获取这些URL上的链接。最后,我想要一本包含其中所有链接的字典以及它们之间的关系。
这就是我前两个深度的内容
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][url]
#Get page source
for link in soup.find_all('a'):
url = link.get('href')
url_tree[siteurl][secondurl][url]
这个系统有效,但正如你所知,如果我想要一个N层深度的字典,那就会成为很多代码块。有没有办法自动添加更多图层?任何帮助表示赞赏!
答案 0 :(得分:0)
这可以使用递归函数来完成。
一个基本示例,它将逐页抓取页面中找到的所有网址,然后逐个抓取该网页中找到的所有网址,依此类推......它还会打印出找到的每个网址。< / p>
def recursive_fetch(url_to_fetch):
# get page source from url_to_fetch
# make a new soup
for link in soup.find_all('a'):
url = link.get('href')
print url
# run the function recursively
# for the current url
recursive_fetch(url)
# Usage
recursive_fetch(root_url)
由于你想要找到所有网址的树的字典,上面的代码并没有多大帮助,但这只是一个开始。
这是它变得非常复杂的地方。因为现在您还需要跟踪被抓取的当前网址的父级,该网址的父级,该网址的父级,该网址的父级,...的父级。
你看,我的意思是什么?它变得非常复杂,非常快。以下是执行所有操作的代码。我在代码中写了一些评论,尽可能地解释它。但是你需要真正理解递归函数如何有助于更好地理解。
首先,让我们看看另一个函数,它将非常有助于从树中获取URL的父级:
def get_parent(tree, parent_list):
"""How it works:
Let's say the `tree` looks like this:
tree = {
'root-url': {
'link-1': {
'link-1-a': {...}
}
}
}
and `parent_list` looks like this:
parent_list = ['root-url', 'link-1', 'link-1-a']
this function will chain the values in the list and
perform a dict lookup like this:
tree['root-url']['link-1']['link-1-a']
"""
first, rest = parent_list[0], parent_list[1:]
try:
if tree[first] and rest:
# if tree or rest aren't empty
# run the function recursively
return get_parent(tree[first], rest)
else:
return tree[first]
except KeyError:
# this is required for creating the
# root_url dict in the tree
# because it doesn't exist yet
tree[first] = {}
return tree[first]
recursive_fetch
函数将如下所示:
url_tree = {} # dict to store the url tree
def recursive_fetch(fetch_url, parents=None):
"""
`parents` is a list of parents of the current url
Example:
parents = ['root-url', 'link-1', ... 'parent-link']
"""
parents = parents or []
parents.append(fetch_url)
# get page source from fetch_url
# make new soup object
for link in soup.find_all('a'):
url = link.get('href')
if parents:
parent = get_parent(url_tree, parents)
else:
parent = None
if parent is not None:
# this will run when parent is not None
# i.e. even if parent is empty dict {}
# create a new dict of the current url
# inside the parent dict
parent[url] = {}
else:
# this url has no parent,
# insert this directly in the url_tree
url_tree[url] = {}
# now crawl the current url
recursive_fetch(url, parents)
# Next is the most import block of code
# Whenever 1 recursion completes,
# it will pop the last parent from
# the `parents` list so that in the
# next recursion, the parents are correct.
# Whithout this block, the url_tree wouldn't
# look as expected.
# It took me many hours to figure this out
try:
parents.pop(-1)
except IndexError:
pass