HtmlAgilityPack xPath抓取

时间:2013-03-12 13:02:31

标签: xpath web-scraping html-agility-pack

我正在试图抓住这个网站 http://www.gotickets.com/calendar.php?Display=Daily&Date=2013-03-12&EventTypeID=2&EventID=0&GenreID=159&VenueID=0&MarketAreaID=0

这是我感兴趣的事情

数据的组织结构如下

<div class="clr dayItem">
 <div class="clr genreHeader">Alternative Rock</div>
 <div class="clr genreEvents">
  <div class="clr dayEvent">
   <a href="/concert/muse/houston_1339329.php" title="7:00 PM Muse - Toyota Center - TX">Muse - Toyota Center - TX - 7:00 PM
   </a>
 </div>
 <div class="clr dayEvent">
  <a href="/concert/matchbox_20/pooler_1347335.php" title="7:30 PM Matchbox 20 - Johnny Mercer Theatre">Matchbox 20 - Johnny Mercer Theatre - 7:30 PM
  </a>
</div>

etc...
  </div>
 </div>

所以基本上 该页面分为两列 每列都有DayItems 其中包括该类型 和带有hrefs的dayEvents

我一直在尝试获取数据,但我对xpath完全不熟悉,并且一直在用Regex进行刮擦直到今天

正则表达式变得繁琐而且太复杂,所以我选择了xPath

获取我使用的DayItems:

 var cl = document.DocumentNode.SelectNodes("//*[contains(concat(' ', normalize-space(@class), ' '), ' dayItem ')]");

 foreach (var item in cl.Where(x=> x.Attributes.Any(p=>p.Value == "clr dayItem" && p.OriginalName=="class")))
      {

            /// THIS LINE FAILS
          var genre = item.SelectSingleNode("//.[contains(concat(' ', normalize-space(@class), ' '), ' genre ')]");


          Console.WriteLine(item.Name);

          foreach (var attr in item.Attributes.Select(x => x.OriginalName + ".." + x.Value))
          {


              Console.WriteLine(attr);
          }
      }

2 个答案:

答案 0 :(得分:1)

以下是使用XPATH轻松完成此操作的方法。这很简单,因为文档结构合理,具有有意义的CLASS属性。

        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load("http://www.gotickets.com/calendar.php?Display=Daily&Date=2013-03-12&EventTypeID=2&EventID=0&GenreID=159&VenueID=0&MarketAreaID=0");

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='clr dayEvent']"))
        {
            Console.WriteLine("Event: " + node.InnerText);

            HtmlNode genre = node.SelectSingleNode("../../div[@class='clr genreHeader']");
            Console.WriteLine(" Genre:" + HtmlAgilityPack.HtmlEntity.DeEntitize(genre.InnerText));
        }

您可以将其改编为Event类。在事件文本中,它不是HTML,因此您必须像在事件代码中那样解析它。

您可以在此处学习XPATH:XPath Tutorial

答案 1 :(得分:0)

这是我的工作代码,它并不像我想的那样干净,但这只是一次数据查找探险。我再也不会使用这个软件了 我希望有人修复我的代码,使其更高效,更好,具体取决于xpath

string html = client.DownloadString("http://www.gotickets.com/calendar.php?Display=Daily&EventTypeID=1&EventID=0&GenreID=159&VenueID=0&MarketAreaID=0" + "&Date=" + MakeDate);


      List<Event> events = new List<Event>();

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
      document.LoadHtml(html);

      var cl = document.DocumentNode.SelectNodes("//*[contains(concat(' ', normalize-space(@class), ' '), ' dayItem ')]");

      foreach (var item in cl)
      {

          var genre_text = item.InnerText.Replace("\t\t", "").Replace("\t\t\t\t", "").Replace("\t\t\t", "").Replace("\t\t", "").Replace("\t", "");

          var lines = genre_text.Split(new string[] {"\n"}, StringSplitOptions.RemoveEmptyEntries).Select(x=>  WebUtility.HtmlDecode(x)).ToArray();

          var genre = lines.Take(1).First();

          events.AddRange(lines.Skip(1).Select(f =>


              new Event(f, f.Split(new string[] { "-" }, StringSplitOptions.RemoveEmptyEntries), genre, this.Date)

              ));


      }

Event类只是一个容器

public class Event
{
    private string OriginalString;
    private string[] p;

    public Event(string originalString, string[] parts, string genre, DateTime date)
    {
        this.OriginalString = originalString;
        this.p = parts;
        this.Genre = genre;
        this.Date = date;
        analyze(parts);
    }
 public override string ToString()
    {
        string pattern = "{0},{1},{2},{3}";
        var s = string.Format(pattern, this.Date.ToString("MMM"), this.Genre, this.Location, this.Performer);
        return s;

    }

    private void analyze(string[] parts)
    {
        if (parts.Length < 3)
        {
            throw new IndexOutOfRangeException("Length < 3 ==> " + parts.Length);
        }

        if (parts.Length > 3)
        {
            this.Performer = parts[0].Trim();
            this.Location = parts[1].Trim() + "-" + parts[2].Trim();

        }
        else
        {
            this.Performer = parts[0].Trim();
            this.Location = parts[1].Trim();

        }

    }

    public string Genre { get; set; }
    public string Performer { get; set; }
    public string Location { get; set; }
    public DateTime Date { get; set; }
}

它有效,但它是UGLYYY