Question

我正在使用lxml.html抓取html文档;我可以在BeautifulSoup中做一件事，但不能与lxml.htm有关。这是：

from BeautifulSoup import BeautifulSoup
import re

doc = ['<html>',
'<h2> some text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> A table</td> </tr> </table>',
'<h2> some special text </h2>',
'<p> some more text </p>',
'<table> <tr> <td> The table I want </td> </tr> </table>',
'</html>']
soup = BeautifulSoup(''.join(doc))
print soup.find(text=re.compile("special")).findNext('table')

我用cssselect尝试了这个，但没有成功。关于如何使用lxml.html中的方法找到它的任何想法？

非常感谢， d

Answer 1

您可以使用EXSLT syntax在lxml Xpath中使用正则表达式。例如，给定您的文档，这将选择其文本与正则表达式spe.*al匹配的父节点：

import re
import lxml.html

NS = 'http://exslt.org/regular-expressions'
tree = lxml.html.fromstring(DOC)

# select sibling table nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::table"
print tree.xpath(path, namespaces={'re': NS})

# select all sibling nodes after matching node
path = "//*[re:test(text(), 'spe.*al')]/following-sibling::*"
print tree.xpath(path, namespaces={'re': NS})

输出：

[<Element table at 7fe21acd3f58>]
[<Element p at 7f76ac2c3f58>, <Element table at 7f76ac2e6050>]

使用lxml.html vs BeautifulSoup定位元素

1 个答案: