Question

使用正则表达式，我希望能够在多个DIV标记之间获取文本。例如，以下内容：

<div>first html tag</div>
<div>another tag</div>

输出：

first html tag
another tag

我使用的正则表达式模式只匹配我的最后一个div标签并错过了第一个。代码：

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<\\/div>)";

        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);

        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);

        Console.ReadLine();
    }

输出：

匹配发现：1

内在DIV：这是另一个测试

Answer 1

用非贪婪的比赛替换你的模式

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}

Answer 2

正如其他人没有提及HTML tags with attributes，这是我解决这个问题的解决方案：

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

Answer 3

首先请记住，在HTML文件中，您将有一个新的行符号（“\ n”），您没有将其包含在用于检查正则表达式的字符串中。

第二个带你的正则表达式：

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

也是寻找此类信息的好地方：

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

Answer 4

您是否看过Html Agility Pack（请参阅https://stackoverflow.com/a/857926/618649）？

CsQuery看起来也很有用（基本上使用CSS选择器式语法来获取元素）。请参阅https://stackoverflow.com/a/11090816/618649。

CsQuery基本上是“用于C＃的jQuery”，这几乎就是我用来找到它的确切搜索条件。

如果您可以在Web浏览器中执行此操作，则可以使用类似于$("div").each(function(idx){ alert( idx + ": " + $(this).text()); }的语法轻松使用jQuery（只有您显然会将结果输出到日志，屏幕或进行Web服务调用用它，或者你需要做什么）。

Answer 5

我认为此代码应该有效：

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }

Answer 6

我希望正则表达式下面的代码可以工作：

<div.*?>(.*?)<*.div>

您将获得所需的输出

这是一个测试这是另一个测试

Answer 7

简短版本是在所有情况下都无法正确执行此操作。总是存在有效HTML的情况，正则表达式无法提取您想要的信息。

原因是因为HTML是一种无上下文语法，它比正则表达式更复杂。

以下是一个示例 - 如果您有多个堆叠的div，该怎么办？

<div><div>stuff</div><div>stuff2</div></div>

作为其他答案列出的正则表达式将抓住：

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

因为这是正则表达式在尝试解析HTML时所做的事情。

您无法编写能够理解如何解释所有情况的正则表达式，因为正则表达式无法执行此操作。如果您正在处理一组非常具体的约束HTML，那么可能会这样，但您应该记住这一事实。

更多信息：https://stackoverflow.com/a/1732454/2022565

使用正则表达式在多个HTML标记之间获取文本

7 个答案: