正则表达式:如何排除另一组中的一个组?

时间:2014-05-06 13:45:12

标签: c# regex

我有一个计算机生成的文本,如下所示(我修改了空白区域,使其在眼睛上更舒适)。

<li class="activitybit forum_post">
    <div class="avatar">
            <img src="image.php?s=64ca7b4cc0fa2850f6c763105eee901b&amp;u=37080&amp;dateline=1396817868&amp;type=thumb" alt="killathi's Avatar" />
    </div>
    <div class="content hasavatar">
        <div class="datetime">
             <span class="date">Today,&nbsp;<span class="time">07:14 PM</span></span>
        </div>
        <div class="title">
                <a href="member.php?37080-killathi&amp;s=64ca7b4cc0fa2850f6c763105eee901b">killathi</a> replied to a thread  <a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b">doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH</a> in <a href="forumdisplay.php?208-Fan-Creations&amp;s=64ca7b4cc0fa2850f6c763105eee901b">Fan Creations</a>
        </div>
        <div class="excerpt">I'll hold this one here for now I guess, not really sure where to go with it lol</div>     
        <div class="fulllink"><a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b&amp;p=9844450#post9844450">see more</a></div>

    </div>
    <div class="views">77 replies | 3407 view(s)</div>
</li>

我使用了正则表达式:(?:<div class=\"title\">)((?:[\s\S]*?))(?:</div>) 我在第一个未被忽略的组中提取了以下内容:

<a href="member.php?37080-killathi&amp;s=64ca7b4cc0fa2850f6c763105eee901b">killathi</a> replied to a thread  <a href="showthread.php?1016907-doodles!-Maybe-I-won-t-have-lines-in-it-this-time!!!-MUAHAHHAHAHAAHAH&amp;s=64ca7b4cc0fa2850f6c763105eee901b">doodles! Maybe I won't have lines in it this time!!! MUAHAHHAHAHAAHAH</a> in <a href="forumdisplay.php?208-Fan-Creations&amp;s=64ca7b4cc0fa2850f6c763105eee901b">Fan Creations</a>

但是,我想知道它是否可能(以及如果是这样的话)如何使用正则表达式排除三角括号内的所有内容。

我知道我需要在((?:[\s\S]*?))做一些事情,但我不确定该怎么做。 (可以安全地假设所有文本都采用这种格式)。

3 个答案:

答案 0 :(得分:2)

要替换三角括号内的所有内容,只需使用此正则表达式:

<[^>]*>
像这样:

string output = Regex.Replace(input, "<[^>]*>", "");

here's the docs

答案 1 :(得分:2)

我建议您使用此库:HTML Agility Pack

您可以像下面这样简单地提取文字:

var doc = new HtmlDocument();
doc.LoadHtml(yourHtml);

var node = doc.DocumentNode.SelectSingleNode("//div[@class='title']");
string result = node.InnerText;

答案 2 :(得分:1)

我认为RegEx Replace可能会这样做,但在一般情况下使用regex来操作html是非常困难的。以下是a fiddle,其中演示了(<.+?>)的使用。它适用于你的例子,但我不保证!