Question

我正在尝试在python中搜索html页面。

我需要在页面内找到某些内容，其中有一种匹配，然后才能返回。

我的链接可以看起来链接此#one，我试图用一个小的正则表达式来获取样本。

#two

问题是它内部的一切，我无法找到我只能搜索href标签内的东西。

希望您理解并希望您能帮我解决这个问题。

我想要搜索的内容，例如。 href="http://example.com/page/subpage/unik-id-12345"

Answer 1

import re
s = 'href="http://example.com/page/subpage/unik-id-12345">'
res = re.search('href=\"(.+?)\">', s).group(1)
print(res)
# Output: http://example.com/page/subpage/unik-id-12345

顺便说一下，最好使用特定的库，比如lxml，进行html解析。

Answer 2

import re
regex = re.compile('<href="(.*)">')
url = '<href="https://stackoverflow.com/">'
m = regex.search(url)

然后你可以得到小组

>>> m.group(0)
'<href="https://stackoverflow.com/">'
>>> m.group(1)
'https://stackoverflow.com/'

PS：如果你正在尝试进行网页抓取，那么使用专为beautifulsoup设计的库会更容易。您可以在网上轻松找到有关如何使用它的tutorials。

Answer 3

你知道regex101.com吗？这是调整你的正则表达式的一个很棒的工具。

如果我理解你的问题，那么你匹配的是href="http://example.com/page/subpage/unik-id-12345">，而你只想获得http://example.com/page/subpage/unik-id-12345

一种方法是只抓取http（s）：//，后跟任何不是引号的内容：http(s?):\/\/[^"]*

如果您有多个链接，并且只想要href标记内的链接，则可能只需使用正则表达式，然后使用更多操作来仅提取网址。（例如match.split("\"")[1]）

或者你可以使用像BeautifulSoup这样的HTML解析器

正则表达式问题需要匹配网址的一部分

3 个答案: