Question

我有以下字符串：

data-event-title="Yuichi Sugita* vs Adrian Mannarino">
                              <span class="odds-container">
                                                             <b class="odds">1/12</b>
                                                                     </a>

我想捕获Yuichi Sugita和1/12。为此，我创建了以下正则表达式： ata-event-title="(.+)".+ class="odds">(.+)< 在括号中有两个捕获组（当我单独使用它们时它们工作正常），但问题是它们之间的.+不能按预期工作。

任何建议都表示赞赏。

Answer 1

如果要捕获data-event-title=""和1/12中的文字，请使用正则表达式 data\-event\-title\=\"(.+?)\"[^\0]*class\=\"odds\".*\>(.+?)\<
https://regex101.com/r/4loeLv/1

或者

如果你想在data-event-title=""里面捕获第一个人的名字，那么就是 data\-event\-title\=\"(.+?) vs.*?\"[^\0]*class\=\"odds\".*\>(.+?)\<
https://regex101.com/r/4loeLv/2

Answer 2

你对点的使用是＆＃34;贪心＆＃34;所以他们尽可能多地捕捉（在这种情况下，你实际上并不想要这样做）。

您可以将捕获组量词更改为＆＃34; lazy＆＃34;，但为捕获组使用否定字符类（语法[^character]）会更有效。

两个捕获组之间的点可以很好地成为＆＃34;贪心＆＃34;因为它无论如何都会遇到class="odds">时停止。

假设您的示例输入显示有换行符，除非您在模式中使用s标记，否则您的点将停止在换行符上。使用此：

r"data-event-title=\"([^*]+).*class=\"odds\">([^<]+)"s

这将捕获：

data-event-title="之后的子字符串在第一次出现*之前结束。
在找到第一个class="odds">之前的<后面的子字符串。

这是Python regex pattern demo。

如果您想要完整的data-event-title属性值，则会捕获Yuichi Sugita* vs Adrian Mannarino：

r"data-event-title=\"([^\"]+).*class=\"odds\">([^<]+)"s

Answer 3

我将交替与竖线或竖线符号（|）一起使用。 read more here

这个正则表达式做你想要的：

>(.*)<|data-event-title="([^*]*.).*"

在此处查看已保存的正则表达式regex101

如何从html文本中捕获两个子串？

3 个答案: