我有一些简单的代码......
from bs4 import BeautifulSoup, SoupStrainer
text = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<div></div>
<div class='detail'></div>
<div></div>
<div class='detail'></div>
<div></div>"""
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
print(div)
...我希望打印两个带有'详细'类的div。相反,由于某种原因,我得到了两个div和doctype:
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
<div class="detail"></div>
<div class="detail"></div>
这里发生了什么?如何避免匹配doctype?
修改
以下是我发现的一种过滤方法:
from bs4 import BeautifulSoup, SoupStrainer, Doctype
...
for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })):
if type(div) is Doctype:
continue
仍然有兴趣了解如何避免在使用SoupStrainer
时我必须过滤掉doctype的情况。
我想使用SoupStrainer
代替find_all
的原因是SoupStrainer
几乎快了两倍,只需要1000个已解析的页面,就会增加约30秒的差异:
def soup_strainer(text):
[div for div in BeautifulSoup(text, 'lxml', parse_only = SoupStrainer('div', attrs = { 'class': 'detail' })) if type(div) is not Doctype]
def find_all(text):
[div for div in BeautifulSoup(text, 'lxml').find_all('div', { 'class': 'detail' })]
from timeit import timeit
print( timeit('soup_strainer(text)', number = 1000, globals = globals()) ) # 38.091634516923584
print( timeit('find_all(text)', number = 1000, globals = globals()) ) # 65.1686057066947
答案 0 :(得分:1)
我认为您不需要使用SoupStrainer
来执行此任务。相反,内置的findAll
方法应该做你想要的。这是我测试的代码,似乎工作正常:
[div for div in BeautifulSoup(text, 'lxml').findAll('div', {'class':'detail'})]
这将创建您要查找的div
个列表,不包括DOCTYPE
希望这有帮助。