Question

我有一个简单的HTML字符串。从该字符串我想提取两个HTML标签之间的内容。

我的源代码是：

"Hello <b>world</b> test"

我想提取：“世界”

我该怎么做？

Answer 1

假设您不是指任何标记，而是指特定标记（在本例中为<b>），并假设您的HTML格式正确，因此不包含嵌套<b>代码：

(?s)<b[^>]*>((?:(?!</b>).)*)</b>

结果将在第1组。

<强>解释

(?s)       # Allow the dot to match newlines (hope you're not using JavaScript)
<b[^>]*>   # Match opening <b> tag
(          # Capture the following:
 (?:       #  Match (and don't capture)...
  (?!      #   (as long as we're not at the start of
    </b>   #    the string </b>
  )        #   )
  .        #  any character.
 )*        #  Repeat any number of times
)          # End of capturing group.
</b>       # Match closing </b> tag

Answer 2

虽然在非常简单的环境中这可能是可能的，但我强烈建议不要这样做。 Regexp不足以解析HTML。使用正确的HTML解析库。

Answer 3

我不知道你正在使用什么语言，这是一个VB.NET示例：

模式将是“hello（。*）test”

并且Regex.Matches函数将获取您的输入和模式并返回一组匹配项。每个匹配将包含组，组0将是整个匹配：“hello world test”，组1将是组内的文本：“world”

System.Text.RegularExpressions.Regex.Matches（“hello world test”，“hello（。+）test”）。Item（0）.Groups（1）

和Dervall一样，Regex可能不够强大，无法满足您的要求，您可能需要对模式进行大量修改才能使用HTML，例如将空格（空格，制表符和换行符）作为示例

Answer 4

我将使用以下表达式，该表达式还将验证结束标记是否与开始标记匹配。

(?<=<(b)>)[^>]+(?=</\1>)

更“易消化”的例子是：

(?<=<(b)>)[^>]+(?=</b>)

在两个标签之间提取内容

4 个答案: