Question

我正在尝试找到一个相对（非绝对）Xpath，它允许我在文本＆＃39; SPLIT TIMES ＆＃39后导入第一个表;。这是我的代码：

from lxml import html
import requests

ResultsPage = requests.get('https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result')
ResultsTree = html.fromstring(ResultsPage.content)
ResultsTable = ResultsTree.xpath(("""//*[text()[contains(normalize-space(), "SPLIT TIMES")]]"""))

print ResultsTable

我正在尝试找到将在“分裂时间”中磨练的Xpath。在此处找到的表格https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result，如下图所示。

如果Xpath可以尽可能多样化，我将不胜感激。例如，要求可能会发生变化，以便找到文本后面的第一个表格，其中包含“10,000 METERS MEN＆＃39; （与上面相同的网址）。或者，我可能需要在文本后面找到第一个表格，其中包含＆＃39; MEDAL TABLE＆＃39; （不同的网址）：https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/medaltable

Answer 1

您的代码存在问题，因为您尝试抓取的网站使用的保护措施会拒绝请求（标题中缺少用户代理，如其他答案所述）：

无法满足请求。请求被阻止。由...生成 cloudfront（CloudFront）

我可以通过使用这个库来绕过这个：cloudflare-scrape。

您可以使用pip安装它：

pip install cfscrape

这里的代码是一个有效的xpath代码，用于实现你想要实现的目标，诀窍是使用＆＃34;以下＆＃34;如文档中所述：https://www.w3.org/TR/xpath/#axes。

import cfscrape
from lxml import html

scraper = cfscrape.create_scraper()
page = scraper.get('https://www.iaaf.org/competitions/iaaf-world-championships/iaaf-world-championships-london-2017-5151/results/men/10000-metres/final/result')
tree = html.fromstring(page.content)
table = tree.xpath(".//h2[contains(text(), 'Split times')][1]/following::table[1]")

Answer 2

您可以通过xpath使用following，如下所示。

relative_string = "Split times"

ResultsTable = ResultsTree.xpath("//*[text()[contains(normalize-space(), '"+relative_string+"')]]/following::table")

在python中，使用相对xpath在给定文本

2 个答案: