BeautifulSoup意外匹配

时间:2017-09-15 13:32:19

标签: python beautifulsoup

我有一些简单的代码......

from bs4 import BeautifulSoup, SoupStrainer

text = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<div></div>
<div class='detail'></div>
<div></div>
<div class='detail'></div>
<div></div>"""

for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
    print(div)

...我希望打印两个带有'详细'类的div。相反,由于某种原因,我得到了两个div和doctype:

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
<div class="detail"></div>
<div class="detail"></div>

这里发生了什么?如何避免匹配doctype?

修改

以下是我发现的一种过滤方法:

from bs4 import BeautifulSoup, SoupStrainer, Doctype
...
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
    if type(div) is Doctype:
        continue

仍然有兴趣了解如何避免在使用SoupStrainer时我必须过滤掉doctype的情况。

我想使用SoupStrainer代替find_all的原因是SoupStrainer几乎快了两倍,只需要1000个已解析的页面,就会增加约30秒的差异:

def soup_strainer(text):
    [div for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })) if type(div) is not Doctype]

def find_all(text):
    [div for div in BeautifulSoup(text, 'lxml').find_all('div', { 'class': 'detail' })]

from timeit import timeit    
print( timeit('soup_strainer(text)', number = 1000, globals = globals()) ) # 38.091634516923584
print( timeit('find_all(text)', number = 1000, globals = globals()) ) # 65.1686057066947

1 个答案:

答案 0 :(得分:1)

我认为您不需要使用SoupStrainer来执行此任务。相反,内置的findAll方法应该做你想要的。这是我测试的代码,似乎工作正常:

[div for div in BeautifulSoup(text, 'lxml').findAll('div', {'class':'detail'})]

这将创建您要查找的div个列表,不包括DOCTYPE

希望这有帮助。