Question

我试图从使用XPATH的网站获取所有链接，URL格式非常具体但动态。

我想要的网址格式为＆＃34; / static_word / random-string-with-dashes / random_number＆＃34; （3段：第一个静态，第二个随机串，第三个随机数）。你能帮助我完成这个吗？

我试图用正则表达式来做但它没有用。

这是我的代码：

from lxml import html
import ssl
import requests
ssl._create_default_https_context = ssl._create_unverified_context
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
myRequest = requests.get("https://somesecureurl.com/", headers=headers)
webpage = html.fromstring(myRequest.content)
theLinks = webpage.xpath("//a[contains(@href,'^/static_word/[A-Za-z0-9_-]/[0-9]$')]")

print(theLinks)

Answer 1

您可以使用matches()来匹配正则表达式所需的字符串：

//a[matches(@href,'^/static_word/[A-Za-z0-9_-]+/[0-9]+$')]

但AFAIK lxml不支持XPath 2.0功能

你可以试试这个：

//a[starts-with(@href, '/static_word/') and 
    (string-length(@href)-string-length(translate(@href, '/', '')))=3 and
    number(substring-after(substring-after(@href, '/static_word/'), '/'))>=0]

上面的谓词应匹配：

starts-with(@href, "/static_word/") - a节点@href以子字符串'/static_word/'开头
(string-length(@href)-string-length(translate(@href, '/', '')))=3 - @href正好包含3个斜杠
number(substring-after(substring-after(@href, '/static_word/'), '/'))>=0 - 最后一个子字符串是任何正数

这看起来很糟糕，但应该有效：）

如何获得所有＆＃34; a＆＃34;标签包含某个＆＃34; href＆＃34;使用Python格式？

1 个答案: