使用HTML Agility Pack关闭'option'标签,同时保留innertext

时间:2014-05-21 12:32:39

标签: c# html html-agility-pack

我希望HTML-Agility-Pack关闭所有打开的'选项'标签,同时仍然保留innertext。我的目标是捕获以下内容:

  • 期权价值(即价值= 1)
  • 选项Innertext(即Value = Alberta)

我编写的C#代码取决于在innertext之后显示的选项结束标记。

以下是原始HTML:

<select id="Province" >
<option value=""> -- Select province --</option>
    <option value="1">Alberta
    <option value="2">British Columbia
    <option value="3">Manitoba
    <option value="4">New Brunswick
    <option value="5">Newfoundland
    <option value="6">Northwest Territories
    <option value="7">Nova Scotia
    <option value="8">Nunavut
    <option value="9">Ontario
    <option value="10">Prince Edward Island
    <option value="11">Quebec
    <option value="12">Saskatchewan
    <option value="13">Yukon
</select>

由HTML-AgilityPack格式化的HTML:

<select id="Province" >
<option value=""> -- Select province --</option>
    <option value="1"></option>Alberta
    <option value="2"></option>British Columbia
    <option value="3"></option>Manitoba
    <option value="4"></option>New Brunswick
    <option value="5"></option>Newfoundland
    <option value="6"></option>Northwest Territories
    <option value="7"></option>Nova Scotia
    <option value="8"></option>Nunavut
    <option value="9"></option>Ontario
    <option value="10"></option>Prince Edward Island
    <option value="11"></option>Quebec
    <option value="12"></option>Saskatchewan
    <option value="13"></option>Yukon
</select>

正如您所看到的,不考虑包含innertext。是否可以在innertext之后添加结束标记?

例如:

<option value="1">Alberta</option>

以下是用于解析HTML的C#代码:

static void LoadProvinces()
    {
        //Read the HTML File and save it to the string 'rawProvinces'
        System.IO.StreamReader myFile = new System.IO.StreamReader("ProvincesCheckout.htm");
        string rawProvinces = myFile.ReadToEnd();

        //This tells HTML-Agility-Pack to close all open Option Tags
        HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;

        //Load the rawProvinces string into HTML-Agility-Pack
        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(rawProvinces);

        //Convert the parsed HTML to the string variable 'parsedHtml' and save it to 'hap.htm'
        string parsedHtml = htmlDoc.DocumentNode.OuterHtml;
        System.IO.StreamWriter file = new System.IO.StreamWriter("hap.htm");
        file.WriteLine(parsedHtml);
        file.Close();

1 个答案:

答案 0 :(得分:0)

由于某种原因,它不起作用,但它应该。虽然您也可以使用String类及其方法自己执行此操作:

            // Get all option elements
        HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//option");
        foreach (HtmlNode node in nodes)
        {
            // Get the outer position of the NextSibling (which would be the text we want to surround with </option>)
            int nextPosition = rawProvinces.IndexOf(node.NextSibling.OuterHtml) + node.NextSibling.OuterHtml.Trim().Length;
            // Check if there isn't already a </option> element
            if (!rawProvinces.Substring(nextPosition, 8).StartsWith("</option"))
            {
                // Add the element
                rawProvinces = rawProvinces.Insert(nextPosition, "</option>");
            }
        }