使用HTML Agility Pack关联相邻元素Vale

时间:2011-07-26 20:36:51

标签: c# web-scraping html-agility-pack

我正在尝试使用文本“Results”获取HTML注释后面的h2元素,然后使用类名为“stockfeed”的表元素。

我已经弄清楚如何提取我需要的数据(见下面),但我不知道如何同时将2个元素拉到一起。我知道我可以使用相同的索引器迭代集合以关联值,但这似乎容易出错,因为我的某个h2元素之一可能没有相邻的表元素(很少但可能)。

示例HTML标记:

<h1>
    Results Page</h1>
<h2>
    Updated Daily @ 10:00 AM</h2>
<div class='someClass1'>
    <!-- Results -->
    <div class='something'>
    </div>
    <h2 style='display: inline;'>
        <a href='http://www.somesite.com'>Table 1</a>
    </h2>
    <div class='clr'>
    </div>
    <div class='resultBlock'>
        <table class='stockfeed'>
            <thead>
                <tr>
                    <th>
                        Part
                    </th>
                    <th>
                        Description
                    </th>
                    <th>
                        Stock
                    </th>
                    <th>
                        Price
                    </th>
                </tr>
            </thead>
            <tbody>
                <tr class='row1' valign='top'>
                    <td>
                        A 1234567890
                    </td>
                    <td class='description'>
                        Part Description
                    </td>
                    <td>
                        1,000,000
                    </td>
                    <td>
                        $1.99
                    </td>
                </tr>
                <tr class='row1' valign='top'>
                    <td>
                        B 1234567890
                    </td>
                    <td class='description'>
                        Part Description
                    </td>
                    <td>
                        1,000,000
                    </td>
                    <td>
                        $1.99
                    </td>
                </tr>
                <tr class='row1' valign='top'>
                    <td>
                        C 1234567890
                    </td>
                    <td class='description'>
                        Part Description
                    </td>
                    <td>
                        1,000,000
                    </td>
                    <td>
                        $1.99
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
    <!-- Results -->
    <div class='something'>
    </div>
    <h2 style='display: inline;'>
        <a href='http://www.somesite.com'>Table 2</a>
    </h2>
    <div class='clr'>
    </div>
    <div class='resultBlock'>
        <table class='stockfeed'>
            <thead>
                <tr>
                    <th>
                        Part
                    </th>
                    <th>
                        Description
                    </th>
                    <th>
                        Stock
                    </th>
                    <th>
                        Price
                    </th>
                </tr>
            </thead>
            <tbody>
                <tr class='row1' valign='top'>
                    <td>
                        A 1234567890
                    </td>
                    <td class='description'>
                        Part Description
                    </td>
                    <td>
                        1,000,000
                    </td>
                    <td>
                        $1.99
                    </td>
                </tr>
                <tr class='row1' valign='top'>
                    <td>
                        B 1234567890
                    </td>
                    <td class='description'>
                        Part Description
                    </td>
                    <td>
                        1,000,000
                    </td>
                    <td>
                        $1.99
                    </td>
                </tr>
                <tr class='row1' valign='top'>
                    <td>
                        C 1234567890
                    </td>
                    <td class='description'>
                        Part Description
                    </td>
                    <td>
                        1,000,000
                    </td>
                    <td>
                        $1.99
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
</div>

分别解析值的当前代码:

    HtmlNodeCollection titles = doc.DocumentNode.SelectNodes("//comment()[contains(.,'Results')]/following-sibling::h2");
    for (int tit = 0; tit < titles.Count; ++tit)
    {
        // Do Something
    }

    HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table[@class='stockfeed']");
    for (int tab = 0; tab < tables.Count; ++tab)
    {
        // Do Something
    }

1 个答案:

答案 0 :(得分:1)

因此,如果我正确地阅读此内容,您将尝试获得每个结果的相应表格。

您可以使用类似的方法获取以下h2元素,以获取相对于它的以下table元素。

var query = doc.DocumentNode
    .SelectNodes("//comment()[contains(.,'Results')]/following-sibling::h2");

foreach (var h2 in query.Cast<HtmlNode>())
{
    var table = h2.SelectSingleNode("following-sibling::*/table[@class='stockfeed']");
    // do stuff with h2 and table
}