Question

我使用以下代码从这个格式的div中提取内容：<div id="post-contents"></div>

string findtext2 = @"<div[^>]*\\id=\post-contents\[^>]*>(.*?)</div>";
string myregex2 = txt;
MatchCollection doregex2 = Regex.Matches(myregex2, findtext2);
string matches2 = "";
foreach (Match match2 in doregex2)
{
    matches2 = (matches2 + (match2.ToString()));
}
return matches2;

但是我在HTML标签上遇到了一些错误。实际上，标签包含一些其他HTML标签，如下所示：

<div id="post-contents"><p dir="ltr">HI HI HI</p></div>

请你帮助我，我怎样才能得到<p dir="ltr">HI HI HI</p>？

谢谢

Answer 1

您的正则表达式在所述情况下运行良好：https://regex101.com/r/jbDN1U/1。但你不能用regexp来处理这样的情况：

<div id="post-contents"><div dir="ltr">HI HI HI</div></div>

在这种情况下，Regexp无法确定选择哪个结束div。正如评论中提到的那样，考虑使用XML解析器。

使用regex在div中提取HTML内容

1 个答案: