Question

自从我使用正则表达式已经有一段时间了，我觉得这应该很容易理解。

我的网页上有很多链接，类似于下面代码中的string_to_match。我想只抓取链接中的数字，例如string_to_match中的数字“58”。对于我的生活，我无法弄清楚。

import re
string_to_match = '<a href="/ncf/teams/roster?teamId=58">Roster</a>'
re.findall('<a href="/ncf/teams/roster?teamId=(/d+)">Roster</a>',string_to_match)

Answer 1

您可以使用HTML解析（使用BeautifulSoup解析器）来定位所需的链接并提取href属性值和URL解析，而不是使用正则表达式，在这种情况下，我们将正则表达式用于：

import re
from bs4 import BeautifulSoup

data = """
<body>
    <a href="/ncf/teams/roster?teamId=58">Roster</a>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]

print(re.search(r"teamId=(\d+)", link).group(1))

打印58。

Answer 2

我建议使用BeautifulSoup或lxml，这值得学习曲线。

...但是如果你还想使用正则表达式

re.findall('href="[^"]*teamId=(\d+)',string_to_match)

使用正则表达式匹配HTML中的URL

2 个答案: