Question

我想使用python正则表达式提取两个不同字符>和<之间的子字符串。

这是我的示例字符串：

<h4 id="Foobar:">Foobar:</h4>
<h1 id="Monty">Python<a href="https://..."></a></h1>

我当前的正则表达式为\>(.*)\<并匹配：

Foobar
Python<a href="https://..."></a>

我的re正确匹配第一个示例，但不匹配第二个示例。我希望它返回“ Python”。我想念什么？

Answer 1

使用表达式：

(?<=>)[^<:]+(?=:?<)

(?<=>)对>的积极期待。
[^<:]+匹配<或:以外的任何内容。
(?=:?<)对可选冒号:和<的正向搜索。

您可以尝试在线here表达式。

在Python中：

import re
first_string = '<h4 id="Foobar:">Foobar:</h4>'
second_string = '<h1 id="Monty">Python<a href="https://..."></a></h1>'

print(re.findall(r'(?<=>)[^<:]+(?=:?<)',first_string)[0])
print(re.findall(r'(?<=>)[^<:]+(?=:?<)',second_string)[0])

打印：

Foobar
Python

或者，您可以使用表达式：

(?<=>)[a-zA-Z]+(?=\W*<)

(?<=>)对>的积极期待。
[a-zA-Z]+小写和大写字母。
(?=\W*<)对所有非单词字符后跟<的正向搜索。

您可以测试此表达式here。

print(re.findall(r'(?<=>)[a-zA-Z]+(?=\W*<)',first_string)[0])
print(re.findall(r'(?<=>)[a-zA-Z]+(?=\W*<)',second_string)[0])

打印：

Foobar
Python

Answer 2

您缺少*量词的贪婪性-使用.可以匹配尽可能多的字符。要将量词切换到非贪婪模式，请添加?：

\>(.*?)\<

您可以在*?, +?, ??部分的documentation中阅读更多内容。

使用python正则表达式提取两个不同字符之间的子字符串

2 个答案: