HtmlAgilityPack - 无法从表中获取嵌套元素innerText

时间:2017-05-14 19:04:15

标签: c# asp.net xpath html-agility-pack

我正在使用HtmlAgilityPack从具有此结构的表中获取数据:

<table>
    <tbody class="border_tbody">
        <tr style="height:55px;">
            <th class="heading_one" colspan="2">Heading 1</th>
            <th class="heading_two">Heading 2</th>
            <th class="heading_three">heading 3</th>
        </tr>
        <tr>
            <td class="ro">
                <a href="go/a/a.com" target="_blank">
                    <img src="images/vendors_images/vendors_ficon/a.png" height="17px" width="17px" alt="a" title="a">
                </a>
            </td>
            <td td="" class="l no_border">
                <a href="go/a/a.com" target="_blank">
                    Vendor name
                </a>
            </td>
            <td class="l lo" style="text-align: center;"><a href="go/a/a.com" target="_blank">15%</a></td>
            <td class="l bonus_amount">
                <a href="go/a/a.com" class="apply_text" target="_blank">
                    <div class="coupon_div">
                        <span class="coupon_span">
                            <span class="card_secondary_text">$10</span>
                        </span>
                    </div>
                </a>
            </td>
        </tr>

        <tr>
            <td class="ro">
                <a href="go/a/a.com" target="_blank">
                    <img src="images/vendors_images/vendors_ficon/a.png" height="17px" width="17px" alt="a" title="a">
                </a>
            </td>
            <td td="" class="l no_border">
                <a href="go/a/a.com" target="_blank">
                    Vender name
                </a>
            </td>
            <td class="l lo" style="text-align: center;"><a href="go/a/a.com" target="_blank">6%</a></td>
            <td class="l" style="text-align: center;"></td>
        </tr>

        <tr>
            <td class="ro">
                <a href="go/a/a.com" target="_blank">
                    <img src="images/vendors_images/vendors_ficon/a.png" height="17px" width="17px" alt="a a" title="a a">
                </a>
            </td>
            <td td="" class="l no_border">
                <a href="go/a/a.com" target="_blank">
                    Vendor name
                </a>
            </td>
            <td class="l lo" style="text-align: center;"><a href="go/a/a.com" target="_blank">5%</a></td>
            <td class="l bonus_amount">
                <a href="apply/a" class="apply_text" target="_blank">
                    <div class="coupon_div">
                        <span class="coupon_span">
                            <span class="card_secondary_text">$50</span> - Apply
                        </span>
                    </div>
                </a>
            </td>
        </tr>

    </tbody>
</table>

我可以从第二个td [2](供应商名称)和第三个td [3](百分比)获取内部文本。我遇到问题的地方是得到第四个td [4]的内部文本,因为如果嵌套元素包含文本,它们会有所不同。

上表显示了三种变体,这是我到目前为止的代码。

foreach (var table in webDoc.DocumentNode.SelectNodes("//table/tbody"))
{
    // skip the first tr since they are headings.
    foreach (var tr in table.SelectNodes("tr[position() > 1]"))
    {
        if (tr != null)
        {
            var vendorName = tr.SelectSingleNode("td[2]/a").InnerText.Trim();
            var rateOne = tr.SelectSingleNode("td[3]/a").InnerText.Trim();

            // Unable to get the inner text at this point
            // var rateTwo = tr.SelectSingleNode("td[4]/a/div/span/span").InnerText.Trim();

        }
    }
}

1 个答案:

答案 0 :(得分:0)

使用问题中给出的示例HTML,看起来第四个单元格的类名始终相同。如果没有,您可以遍历所有后代节点,查找以美元符号开头的文本:

HtmlDocument webDoc = new HtmlDocument();
webDoc.LoadHtml(html);
foreach (var table in webDoc.DocumentNode.SelectNodes("//table/tbody"))
{
    foreach (var tr in table.SelectNodes("tr[position() > 1]"))
    {
        if (tr != null)
        {
            // [1] class name in HTML sample always the same
            var rateTwo = tr.SelectSingleNode("td[4]//span[@class='card_secondary_text']");
            Console.WriteLine("Method 1 Coupon: {0}",
                rateTwo != null ? rateTwo.InnerText : "none"
            );

            // [2] brute force - all descendants
            var rateTwo2 = tr.SelectSingleNode("td[4]").Descendants();
            if (rateTwo2.Count() > 0)
            {
                foreach (var child in rateTwo2)
                {
                    if (child.InnerText.StartsWith("$") && child.NodeType == HtmlNodeType.Element) 
                        Console.WriteLine("Method 2 Coupon: {0}", child.InnerText);
                }
            }
            else 
            {
                Console.WriteLine("Method 2: No coupon");
            }
        }
    }
}

输出:

Method 1 Coupon: $10
Method 2 Coupon: $10
Method 1 Coupon: none
Method 2: No coupon
Method 1 Coupon: $50
Method 2 Coupon: $50