Question

我正在尝试解析一个包含表行的html页面。我需要在表格行中获取所有表格单元格。

这是我试图解析的html示例：

<tr style="font-size:8pt;">
    <TD style="font-size:8pt;">1545644656</TD>
    <TD style="font-size:8pt;">Billy</TD>
    <TD style="font-size:8pt;">Johnson</TD>
    <TD style="font-size:8pt;">DEF</TD>

        <TD style="font-size:8pt;"></TD>
        <TD style="font-size:8pt;">1134 Main St</TD>
        <TD style="font-size:8pt;"></TD>
        <TD style="font-size:8pt;">AnyTown</TD>
        <TD style="font-size:8pt;">PA</TD>
        <TD style="font-size:8pt;">05405</TD>

</TR>

这是正则表达式我用来获取tr start和tr end之间的所有东西

Regex exp = new Regex("<tr style=\"font-size:8pt;\">(.*?)</TR>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

然后我做了一个foreach循环来遍历我的所有匹配（会有多行）

foreach (Match mtch in exp.Matches(browser.Html))

但它没有匹配任何东西。我有这个完全相同的代码在网站上工作之前他们添加了新的行（\ n），当它只是一个长的字符串...现在它不匹配任何多线方法他们＆＃39;重新使用。

这里有什么想法吗？

Answer 1

我会使用像HtmlAgilityPack这样的真正的html解析器

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var tds = doc.DocumentNode.Descendants("td")
                .Select(td=>td.InnerText)
                .ToList();

Answer 2

。是一个匹配任何字符的通配符，但\ n。

http://msdn.microsoft.com/en-us/library/az24scfc.aspx#character_classes

http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx

我相信你需要RegexOptions.Singleline。

需要在c＃中使用带有多行的正则表达式

2 个答案: