XPath检索<a> href, text, and <span>

时间:2017-04-20 15:31:02

标签: c# html xpath html-agility-pack

I'm currently crawling some web sites and retrieving information from them to store into a database for later use. I'm using HtmlAgilityPack and I've successfully done this for a few sites now but for some reason this one is giving me issues. I'm fairly new to XPath syntax still so I'm probably messing up there.

Heres what the code from the site looks like that I'm trying to retreive:

<form ... id="_subcat_ids_">
  <input ....>
  <ul ...>
    <li ....>
      <input .....>
      <a class="facet-seleection multiselect-facets "
      .... href="INeedThisHref#1">
      Text I Need                          //need to retrieve this text between then <a></a>
      <span class="subtle-note">(2)</span> //I Need that number from inside the span
      </a>
    </li>
    <li ....>
      <input .....>
      <a class="facet-seleection multiselect-facets "
      .... href="INeedThisHref#2">
      Text I Need #2                        //need to retrieve this text between then <a></a>
      <span class="subtle-note">(6)</span> //I Need that number from inside the span
      </a>
    </li>

Each one of those represents an item on a page, but I'm only interested in what is happening with each <a></a>. I want to retrieve that href value from inside the <a>, then the text between the opening and closing, then I need the text inside the <span>. I left out the stuff inside of the other tags because they do not help uniquely identify each item, the class inside <a> is the only thing they share, and they are all inside of the form with id="_subcat_ids_".

Heres my code:

try
{
   string fullUrl = "...";
   HtmlWeb web = new HtmlWeb();
   ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
  HtmlDocument html = web.Load(fullUrl);

  foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form 
  {
    foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection  multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object'
    {
      //get the href
      string tempHref = node2.GetAttributeValue("href", string.Empty);
      //get the text between <a>
      string tempCat = node2.InnerText.Trim();
      //get the text between <span>
      string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
    }
  }
}
catch (Exception ex)
{
  Console.Write("\nError: " + ex.ToString());
}

That first foreach loop doesn't error, but the second one gives me object reference not set to an instance of an object at the line where my second foreach loop is. Like I mentioned before, I'm still new to this syntax, I've used this type of method on another website with great success but I'm having some trouble with this site. Any tips would be appreciated.

1 个答案:

答案 0 :(得分:0)

好吧,我想通了,继承人代码

foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']"))
{
  //get the categories, store in list
  foreach (HtmlNode node2 in node.SelectNodes("..//a[@class='facet-selection  multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]"))
  {
    string tempCat = node2.InnerText.Trim();
    categoryList.Add(tempCat);
    Console.Write("\nCategory: " + tempCat);           
  }
  foreach (HtmlNode node3 in node.SelectNodes("..//a[@class='facet-selection  multiselect-facets ']"))
  {
    //get href for each category, store in list
    string tempHref = node3.GetAttributeValue("href", string.Empty);
    LinkCatList.Add(tempHref);
    Console.Write("\nhref: " + tempHref);
    //get the number of items from categories, store in list
    string tempNum = node3.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
    string tp = tempNum.Replace("(", "");
    tempNum = tp;
    tp = tempNum.Replace(")", "");
    tempNum = tp;
    Console.Write("\nNumber of items: " + tempNum + "\n\n");
   }
}

就像一个魅力