BeautifulSoup4的表现

时间:2016-03-14 12:43:29

标签: python beautifulsoup lxml bs4

这是一段简单的代码,用于查找具有特定ID的元素。作为例子,我接受了随机的Wiki文章。

要测试的代码:

# coding: utf8

from bs4 import BeautifulSoup, Tag
import requests
import time
import sys

TAG_NAME = "li"


def find_with_index(index, id):
    if id and id in index:
        return index[id]
    return None

page_text = requests.get("https://en.wikipedia.org/wiki/United_States").text
page = BeautifulSoup(page_text, 'lxml')
if page:
    print("Page was downloaded and parsed")
else:
    print("Something wrong.")
    sys.exit()

all_ids = set()
for child in page.recursiveChildGenerator():
    if type(child) is Tag and child.has_attr("id") and child.name == TAG_NAME:
        all_ids.add(child.attrs["id"])

print(str(len(all_ids)) + " ids in total")

bs_find_start = time.clock()
[page.find(TAG_NAME, {"id": id}) for id in all_ids]
bs_find_end = time.clock()

index_find_start = time.clock()
simple_index = {li.attrs["id"]: li for li in page.find_all(TAG_NAME) if li.has_attr("id")}
[find_with_index(simple_index, id) for id in all_ids]
index_find_end = time.clock()

print("Spent on bs.find: " + str(bs_find_end - bs_find_start))
print("Spent on indexed find: " + str(index_find_end - index_find_start))

我有这个输出:

Spent on bs.find: 122.81345616345673
Spent on indexed find: 0.027779648046461602

以下是问题: 这在性能方面是绝对的灾难。这是否意味着BS内部没有任何索引,并且无论我需要找到什么,都会一遍又一遍地遍历整个DOM树来执行查找操作?或者我不完全了解如何有效地执行查找操作?当有很多查找操作(100+)时,这可能是一个严重的瓶颈,我不能说找到问题是非常明显的。

0 个答案:

没有答案