Question

我正在尝试从带有xpath的页面获取链接。问题是我只想要表中的链接，但如果我在整个页面上应用xpath表达式，我将捕获我不想要的链接。

例如：

tree = lxml.html.parse(some_response)
links = tree.xpath("//a[contains(@href, 'http://www.example.com/filter/')]")

问题是将表达式应用于整个文档。我找到了我想要的元素，例如：

tree = lxml.html.parse(some_response)
root = tree.getroot()
table = root[1][5] #for example
links = table.xpath("//a[contains(@href, 'http://www.example.com/filter/')]")

但是这似乎也在整个文档中执行查询，因为我仍在捕获表外的链接。 This page表示“当在元素上使用xpath（）时，XPath表达式将针对元素（如果是相对的）或针对根树（如果是绝对的）进行评估：”。那么，我使用的是绝对表达式，我需要使它相对吗？是吗？

基本上，我怎样才能只过滤此表中存在的元素？

Answer 1

你的xpath以斜杠（/）开头，因此是绝对的。在前面添加一个点（.）使其相对于当前元素，即

links = table.xpath(".//a[contains(@href, 'http://www.example.com/filter/')]")

Answer 2

另一个选择是直接询问表格中的元素。例如：

tree = lxml.html.parse(some_response)
links = tree.xpath("//table[**criteria**]//a[contains(@href, 'http://www.example.com/filter/')]")

如果页面中有许多表，则需要**criteria**。一些可能的标准是基于表id或类进行过滤。例如：

links = tree.xpath("//table[@id='my_table_id']//a[contains(@href, 'http://www.example.com/filter/')]")

Python：在特定元素上本地使用xpath

2 个答案: