如何避免使用HtmlAgilityPack重复从HTML源提取的数据

时间:2015-09-19 15:21:13

标签: c# html vb.net extract html-agility-pack

我正在使用HtmlAgilityPack从HTML代码源中提取数据。 这是HTML的一个例子:

<div class="enum-container">
    <div class="enum">
        <span class="field-key">MD5</span> a4188cf2b9189f82b855350233a307eb
    </div>
    <div class="enum">
        <span class="field-key">SHA1</span> c3eedd67a14810b8c639eb77ed2731e574245b2a
    </div>
    <div class="enum">
        <span class="field-key">File size</span>
        3.8 KB ( 3854 bytes )
    </div>
</div>

我使用此代码:

    Dim Table2 As New DataTable()
    Table2.Columns.Add("Value1", GetType(String))
    Table2.Columns.Add("Value2", GetType(String))

    For Each row1 As HtmlNode In doc.DocumentNode.SelectNodes("//div[@id='file-details']//div[@class='enum-container']//div[@class='enum']")
        Dim MyValue1 As HtmlNode = row1.SelectSingleNode("//span[@class='field-key']")
        Dim MyValue2 As String = row1.InnerText
        Table2.Rows.Add(MyValue1.InnerText, MyValue2)
    Next

    DataGridView3.DataSource = Table2

结果如下:

http://i.stack.imgur.com/vPriY.png

您可以看到,第一列获得了重复值( MD5 )。

我想要的是这样的:

http://i.stack.imgur.com/jlsk5.png

谢谢。

1 个答案:

答案 0 :(得分:0)

您正在选择文档中与“//”xpath匹配的文档中的第一个范围。你需要从你的xpath中删除它,所以它会选择直接的后代。

<强> C#

DataTable fileDetailsTable = new DataTable();
fileDetailsTable.Columns.Add("Key", typeof(string));
fileDetailsTable.Columns.Add("Value", typeof(string));

HtmlNodeCollection enumNodes = document.DocumentNode.SelectNodes("//div[@id='file-details']//div[@class='enum-container']//div[@class='enum']");
foreach (HtmlNode enumNode in enumNodes)
{
    //Select the child span from the enum node.
    HtmlNode fieldKeyNode = enumNode.SelectSingleNode("span[@class='field-key']");

    if (fieldKeyNode != null)
    {
        //Grab the key.
        string fieldKey = fieldKeyNode.InnerText;

        //Grab the value which is the field key's sibling
        string fieldValue = fieldKeyNode.NextSibling.InnerText;

        fileDetailsTable.Rows.Add(fieldKey, fieldValue);
    }
}

<强> VB.NET

Dim fileDetailsTable As New DataTable()
fileDetailsTable.Columns.Add("Key", GetType(String))
fileDetailsTable.Columns.Add("Value", GetType(String))

Dim enumNodes As HtmlNodeCollection = document.DocumentNode.SelectNodes("//div[@id='file-details']//div[@class='enum-container']//div[@class='enum']")
For Each enumNode As HtmlNode In enumNodes
    'Select the child span from the enum node.
    Dim fieldKeyNode As HtmlNode = enumNode.SelectSingleNode("span[@class='field-key']")

    If fieldKeyNode IsNot Nothing Then
        'Grab the key.
        Dim fieldKey As String = fieldKeyNode.InnerText

        'Grab the value which is the field key's sibling
        Dim fieldValue As String = fieldKeyNode.NextSibling.InnerText

        fileDetailsTable.Rows.Add(fieldKey, fieldValue)
    End If
Next