这是一段简单的代码,用于查找具有特定ID的元素。作为例子,我接受了随机的Wiki文章。
要测试的代码:
# coding: utf8
from bs4 import BeautifulSoup, Tag
import requests
import time
import sys
TAG_NAME = "li"
def find_with_index(index, id):
if id and id in index:
return index[id]
return None
page_text = requests.get("https://en.wikipedia.org/wiki/United_States").text
page = BeautifulSoup(page_text, 'lxml')
if page:
print("Page was downloaded and parsed")
else:
print("Something wrong.")
sys.exit()
all_ids = set()
for child in page.recursiveChildGenerator():
if type(child) is Tag and child.has_attr("id") and child.name == TAG_NAME:
all_ids.add(child.attrs["id"])
print(str(len(all_ids)) + " ids in total")
bs_find_start = time.clock()
[page.find(TAG_NAME, {"id": id}) for id in all_ids]
bs_find_end = time.clock()
index_find_start = time.clock()
simple_index = {li.attrs["id"]: li for li in page.find_all(TAG_NAME) if li.has_attr("id")}
[find_with_index(simple_index, id) for id in all_ids]
index_find_end = time.clock()
print("Spent on bs.find: " + str(bs_find_end - bs_find_start))
print("Spent on indexed find: " + str(index_find_end - index_find_start))
我有这个输出:
Spent on bs.find: 122.81345616345673
Spent on indexed find: 0.027779648046461602
以下是问题: 这在性能方面是绝对的灾难。这是否意味着BS内部没有任何索引,并且无论我需要找到什么,都会一遍又一遍地遍历整个DOM树来执行查找操作?或者我不完全了解如何有效地执行查找操作?当有很多查找操作(100+)时,这可能是一个严重的瓶颈,我不能说找到问题是非常明显的。