Question

我正在做一个非常简单的任务：解析网站，寻找

<tbody>this is what important for me</tbody>`

并返回，但我无法使它工作。当我这样做时：

Regex.Matches(webData, @"<tbody>(.*?)</tbody>")

它没有给我任何结果。然而，这给了我2个结果：

Regex.Matches(webData, @"tbody")

但同样，这个

Regex.Matches(webData, @"tbody(.*?)tbody")

什么也没给我（所以我认为逃避不是问题）。我在this page找到了(.*?)，我认为它很容易使用，但我无法解决这个问题。

Answer 1

建议不要使用regex解析html

regex用于定期发生的模式。html与其格式不一致（xhtml除外）。例如html文件即使你不要拥有closing tag！这可能会破坏您的代码。

使用像htmlagilitypack

这样的html解析器

您可以使用此代码使用HtmlAgilityPack

检索所有tbody的内容

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var tbodyList= doc.DocumentNode.SelectNodes("//tBody")
                  .Select(p => p.InnerText)
                  .ToList();

tbodyList包含整个文档中的所有tbody值！

Answer 2

要解析网页，请使用像HtmlAgilityPack

这样的真实html解析器

string html = "<tbody>this is what important for me</tbody>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var text = doc.DocumentNode.Descendants("tbody").First().InnerText;

Answer 3

我也推荐使用HtmlAgilityPack。

您也可以使用XPath（http://www.w3schools.com/xpath/）

在I4V示例中：

var text = doc.DocumentNode.SelectSingleNode("//tbody").InnerText;

正则表达式匹配不起作用

3 个答案: