如何迭代具有特定值的特定属性的所有标记?例如,假设我们只需要data1,data2等。
<html>
<body>
<invalid html here/>
<dont care> ... </dont care>
<invalid html here too/>
<interesting attrib1="naah, it is not this"> ... </interesting tag>
<interesting attrib1="yes, this is what we want">
<group>
<line>
data
</line>
</group>
<group>
<line>
data1
<line>
</group>
<group>
<line>
data2
<line>
</group>
</interesting>
</body>
</html>
我尝试过BeautifulSoup,但无法解析文件。但是,lxml的解析器似乎有效:
broken_html = get_sanitized_data(SITE)
parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
我不熟悉它的API,我无法弄清楚如何使用getiterator或xpath。
答案 0 :(得分:3)
这是一种方法,使用lxml和XPath 'descendant::*[@attrib1="yes, this is what we want"]'
。 XPath告诉lxml查看当前节点的所有后代,并返回attrib1
属性等于"yes, this is what we want"
的那些。
import lxml.html as lh
import cStringIO
content='''
<html>
<body>
<invalid html here/>
<dont care> ... </dont care>
<invalid html here too/>
<interesting attrib1="naah, it is not this"> ... </interesting tag>
<interesting attrib1="yes, this is what we want">
<group>
<line>
data
</line>
</group>
<group>
<line>
data1
<line>
</group>
<group>
<line>
data2
<line>
</group>
</interesting>
</body>
</html>
'''
doc=lh.parse(cStringIO.StringIO(content))
tags=doc.xpath('descendant::*[@attrib1="yes, this is what we want"]')
print(tags)
# [<Element interesting at b767e14c>]
for tag in tags:
print(lh.tostring(tag))
# <interesting attrib1="yes, this is what we want"><group><line>
# data
# </line></group><group><line>
# data1
# <line></line></line></group><group><line>
# data2
# <line></line></line></group></interesting>