C#HtmlAgilityPack:startIndex不能大于字符串的长度

时间:2014-02-21 19:19:15

标签: c# html-parsing html-agility-pack

我试图做这样的事情:

var document = htmlWeb.Load(searchUrl);
var hotels = document.DocumentNode.Descendants("div")
             .Where(x => x.Attributes.Contains("class") &&
             x.Attributes["class"].Value.Contains("listing-content"));

int count = 1;
foreach (var hotel in hotels)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.OptionFixNestedTags = true;
    htmlDoc.Load(hotel.InnerText);      
    if (htmlDoc.DocumentNode != null)
    {
        var anchors = htmlDoc.DocumentNode.Descendants("div")
                    .Where(x => x.Attributes.Contains("class") &&
                    x.Attributes["class"].Value.Contains("srp-business-name")); // Error Occurring in here //
        foreach (var anchor in anchors)
        {
            Console.WriteLine(anchor.InnerHtml);
        }
    }
}

我得到的结果如下:

<a href="http://ad.doubleclick.net/clk;234504055;58257942;j?http://www.marriott.com/NYCMQ" class="url  mip-link" data-analytics="{&quot;click_id&quot;:1601,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;target&quot;:&quot;name&quot;,&quot;supermedia&quot;:true}" rel="nofollow">New York Marriott Marquis</a>
<a href="http://www.yellowpages.com/new-york-ny/mip/new-york-marriott-marquis-468349733?lid=1000372156461" class="no-tracks hidden url" data-analytics="{&quot;click_id&quot;:1601,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;target&quot;:&quot;name&quot;,&quot;supermedia&quot;:true}" rel="nofollow"></a>
<span class="external-link">
<img height="15" src="/images/sprites/search/icon-link-external.png" width="16">
</span>

<a href="http://www.yellowpages.com/new-york-ny/mip/courtyard-by-marriott-new-york-manhattan-times-square-south-2198956?lid=178101818" class="url redbold mip-link" data-analytics="{&quot;click_id&quot;:1600,&quot;rank&quot;:2,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;target&quot;:&quot;name&quot;,&quot;supermedia&quot;:&quot;&quot;}">Courtyard by Marriott New York Manhattan/Times Square South</a>

等等。

现在我想要innerHtml的{​​{1}}锚点标签。所以我这样做:

class="url redbold mip-link"

我&#39;正确获得第一个结果var document = htmlWeb.Load(searchUrl); var hotels = document.DocumentNode.Descendants("div") .Where(x => x.Attributes.Contains("class") && x.Attributes["class"].Value.Contains("listing-content")); int count = 1; foreach (var hotel in hotels) { HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.OptionFixNestedTags = true; htmlDoc.Load(hotel.InnerText); if (htmlDoc.DocumentNode != null) { var anchors = htmlDoc.DocumentNode.Descendants("div") .Where(x => x.Attributes.Contains("class") && x.Attributes["class"].Value.Contains("srp-business-name")); foreach (var anchor in anchors) { htmlDoc.LoadHtml(anchor.InnerHtml); var hoteltags = htmlDoc.DocumentNode.SelectNodes("//a"); foreach (var tag in hoteltags) { if (!string.IsNullOrEmpty(tag.InnerHtml) || !string.IsNullOrWhiteSpace(tag.InnerHtml)) { Console.WriteLine(tag.InnerHtml); } } } } } 但在第二个结果中发生错误: New York Marriott Marquis。我做错了什么?

1 个答案:

答案 0 :(得分:1)

您正在为所有操作使用相同的DOM对象:

foreach (var hotel in hotels)
{
    HtmlDocument htmlDoc = new HtmlDocument();

之后,您使用相同的对象来加载锚标记:

foreach (var anchor in anchors)
        {
            htmlDoc.LoadHtml(anchor.InnerHtml);

只需更改第二个迭代器中的文档,它应该按预期工作。

  foreach (var anchor in anchors)
            {
                var htmlDocAnchor= new HtmlDocument();
                htmlDocAnchor.LoadHtml(anchor.InnerHtml);// And etc..