Question

我有一个计算机生成的文本，如下所示（我修改了空白区域，使其在眼睛上更舒适）。

<li class="activitybit forum_post">
    <div class="avatar">
            <img src="image.php?s=64ca7b4cc0fa2850f6c763105eee901b&amp;u=37080&amp;dateline=1396817868&amp;type=thumb" alt="killathi's Avatar" />
    </div>
    <div class="content hasavatar">
        <div class="datetime">
             <span class="date">Today,&nbsp;<span class="time">07:14 PM</span></span>
        </div>
        <div class="title">
                <a href="member.php?37080-killathi&amp;s=64ca7b4cc0fa2850f6c763105eee901b">killathi</a> replied to a thread  <a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b">doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH</a> in <a href="forumdisplay.php?208-Fan-Creations&amp;s=64ca7b4cc0fa2850f6c763105eee901b">Fan Creations</a>
        </div>
        <div class="excerpt">I'll hold this one here for now I guess, not really sure where to go with it lol</div>     
        <div class="fulllink"><a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b&amp;p=9844450#post9844450">see more</a></div>

    </div>
    <div class="views">77 replies | 3407 view(s)</div>
</li>

我使用了正则表达式：(?:<div class=\"title\">)((?:[\s\S]*?))(?:</div>) 我在第一个未被忽略的组中提取了以下内容：

<a href="member.php?37080-killathi&amp;s=64ca7b4cc0fa2850f6c763105eee901b">killathi</a> replied to a thread  <a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b">doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH</a> in <a href="forumdisplay.php?208-Fan-Creations&amp;s=64ca7b4cc0fa2850f6c763105eee901b">Fan Creations</a>

但是，我想知道它是否可能（以及如果是这样的话）如何使用正则表达式排除三角括号内的所有内容。

我知道我需要在((?:[\s\S]*?))做一些事情，但我不确定该怎么做。（可以安全地假设所有文本都采用这种格式）。

Answer 1

要替换三角括号内的所有内容，只需使用此正则表达式：

<[^>]*>

像这样：

string output = Regex.Replace(input, "<[^>]*>", "");

here's the docs

Answer 2

我建议您使用此库：HTML Agility Pack

您可以像下面这样简单地提取文字：

var doc = new HtmlDocument();
doc.LoadHtml(yourHtml);

var node = doc.DocumentNode.SelectSingleNode("//div[@class='title']");
string result = node.InnerText;

Answer 3

我认为RegEx Replace可能会这样做，但在一般情况下使用regex来操作html是非常困难的。以下是a fiddle，其中演示了(<.+?>)的使用。它适用于你的例子，但我不保证！

正则表达式：如何排除另一组中的一个组？

3 个答案: