Question

我有一个HTML Table，如下所示：

<table border='1' width='100%'>
<tr>
<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Title2</p>
    </th>
</tr>
<tr>
    <th>
        <div>Content2</div>
    </th>
</tr>
</table>
</td>

<td>
<table border='1' width='100%'>
<tr>
    <th>
        <p>Hello Title1</p>
    </th>
</tr>
<tr>
    <th>
        <div>Hello content 1</div>
    </th>
</tr>
</table>
</td>
</tr>
</table>

我正在创建一个Windows应用程序来读取所有标题并在列表中显示它们。当用户按下列表中的任何标题时，它需要显示所选表格的内容。

问：如何阅读所有标题并使用HTMLAgilityPack或任何其他解析器显示它们？

到目前为止，我已经这样做了：

        WebClient wc = new WebClient();
        System.IO.Stream stream = wc.OpenRead(strFilePath);
        StreamReader sReader = new StreamReader(stream);
        string strTables = sReader.ReadToEnd();
        if (!string.IsNullOrEmpty(strTables))
        { 
            //code to parse html tables
        }

您注意到标题位于<p>元素内，而内容位于<div>元素内。有什么想法吗？

Answer 1

HTML当然也是XML，为什么不使用XmlReader？

之后，使用所有XmlDocument方法和LINQ，您可以找到您要查找的内容。它将为您提供比您手动编写的任何更灵活，可维护，更高效的代码。

当然，如果您的意思是“没有外部HTML解析器”。

Answer 2

尽管使用Regex解析HTML不是最好的做法，但它是选项：

模式：

<p>.*</p>
<div>.*</div>

示例：

    WebClient wc = new WebClient();
    System.IO.Stream stream = wc.OpenRead(strFilePath);
    StreamReader sReader = new StreamReader(stream);
    string strTables = sReader.ReadToEnd();
    if (!string.IsNullOrEmpty(strTables))
    { 
        // I'm not a regex master but I'm sure there is a way to get the title without the <p> elements.
        var pMatches = Regex.Matches(strTables, "<p>.*</p>"));
        foreach(var pMatch in pMatches)
        {
           string title = pMatch.Replace('<p>',string.Empty).Replace('</p>', string.Empty);
        }
    }

解析C＃HTML String没有像AgilityPack这样的html解析器

2 个答案: