Question

我有以下文字：lorem ipsum lorem ipsum 我需要把它组成四组（用法律表达）：

lorem
ipsum
lorem ipsum
lorem ipsum

我想我应该这样做：

 - 代表段落
<p\s></p\s*> - 对于<p之后的空格和>之前的无限空格
<p\s.*></p\s*> - 对于p>之前的任何字符（对于类等）
<p\s.*>.*</p\s*> - 对于段落的任何值

但现在，如果我Lorem Ipsum，我会得到一个组['Lorem Ipsum']。我理解为什么，但我不知道要改进它，因为我需要有两组['Lorem', 'Ipsum']。你有什么想法吗？

PS：我使用Python和re模块。

Answer 1

在re .*中，贪婪意味着它将匹配尽可能多的文本。添加?以使文字不贪婪：

 <p\s.*?>.*?</p\s*?>

以下是文档：

*?, +?, ??

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. 
Sometimes this behaviour isn’t desired; if the RE <.*> is matched against 
'<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' 
after the qualifier makes it perform the match in non-greedy or minimal fashion; as
few characters as possible will be matched. Using .*? in the previous expression will 
match only '<H1>'.

此处提供了文档：

https://docs.python.org/2/library/re.html

（嵌套）html标记的正则表达式

1 个答案: