这是我感兴趣的事情
数据的组织结构如下
<div class="clr dayItem">
<div class="clr genreHeader">Alternative Rock</div>
<div class="clr genreEvents">
<div class="clr dayEvent">
<a href="/concert/muse/houston_1339329.php" title="7:00 PM Muse - Toyota Center - TX">Muse - Toyota Center - TX - 7:00 PM
</a>
</div>
<div class="clr dayEvent">
<a href="/concert/matchbox_20/pooler_1347335.php" title="7:30 PM Matchbox 20 - Johnny Mercer Theatre">Matchbox 20 - Johnny Mercer Theatre - 7:30 PM
</a>
</div>
etc...
</div>
</div>
所以基本上 该页面分为两列 每列都有DayItems 其中包括该类型 和带有hrefs的dayEvents
我一直在尝试获取数据,但我对xpath完全不熟悉,并且一直在用Regex进行刮擦直到今天
正则表达式变得繁琐而且太复杂,所以我选择了xPath
获取我使用的DayItems:
var cl = document.DocumentNode.SelectNodes("//*[contains(concat(' ', normalize-space(@class), ' '), ' dayItem ')]");
foreach (var item in cl.Where(x=> x.Attributes.Any(p=>p.Value == "clr dayItem" && p.OriginalName=="class")))
{
/// THIS LINE FAILS
var genre = item.SelectSingleNode("//.[contains(concat(' ', normalize-space(@class), ' '), ' genre ')]");
Console.WriteLine(item.Name);
foreach (var attr in item.Attributes.Select(x => x.OriginalName + ".." + x.Value))
{
Console.WriteLine(attr);
}
}
答案 0 :(得分:1)
以下是使用XPATH轻松完成此操作的方法。这很简单,因为文档结构合理,具有有意义的CLASS属性。
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.gotickets.com/calendar.php?Display=Daily&Date=2013-03-12&EventTypeID=2&EventID=0&GenreID=159&VenueID=0&MarketAreaID=0");
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='clr dayEvent']"))
{
Console.WriteLine("Event: " + node.InnerText);
HtmlNode genre = node.SelectSingleNode("../../div[@class='clr genreHeader']");
Console.WriteLine(" Genre:" + HtmlAgilityPack.HtmlEntity.DeEntitize(genre.InnerText));
}
您可以将其改编为Event类。在事件文本中,它不是HTML,因此您必须像在事件代码中那样解析它。
您可以在此处学习XPATH:XPath Tutorial
答案 1 :(得分:0)
这是我的工作代码,它并不像我想的那样干净,但这只是一次数据查找探险。我再也不会使用这个软件了 我希望有人修复我的代码,使其更高效,更好,具体取决于xpath
string html = client.DownloadString("http://www.gotickets.com/calendar.php?Display=Daily&EventTypeID=1&EventID=0&GenreID=159&VenueID=0&MarketAreaID=0" + "&Date=" + MakeDate);
List<Event> events = new List<Event>();
HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var cl = document.DocumentNode.SelectNodes("//*[contains(concat(' ', normalize-space(@class), ' '), ' dayItem ')]");
foreach (var item in cl)
{
var genre_text = item.InnerText.Replace("\t\t", "").Replace("\t\t\t\t", "").Replace("\t\t\t", "").Replace("\t\t", "").Replace("\t", "");
var lines = genre_text.Split(new string[] {"\n"}, StringSplitOptions.RemoveEmptyEntries).Select(x=> WebUtility.HtmlDecode(x)).ToArray();
var genre = lines.Take(1).First();
events.AddRange(lines.Skip(1).Select(f =>
new Event(f, f.Split(new string[] { "-" }, StringSplitOptions.RemoveEmptyEntries), genre, this.Date)
));
}
Event类只是一个容器
public class Event
{
private string OriginalString;
private string[] p;
public Event(string originalString, string[] parts, string genre, DateTime date)
{
this.OriginalString = originalString;
this.p = parts;
this.Genre = genre;
this.Date = date;
analyze(parts);
}
public override string ToString()
{
string pattern = "{0},{1},{2},{3}";
var s = string.Format(pattern, this.Date.ToString("MMM"), this.Genre, this.Location, this.Performer);
return s;
}
private void analyze(string[] parts)
{
if (parts.Length < 3)
{
throw new IndexOutOfRangeException("Length < 3 ==> " + parts.Length);
}
if (parts.Length > 3)
{
this.Performer = parts[0].Trim();
this.Location = parts[1].Trim() + "-" + parts[2].Trim();
}
else
{
this.Performer = parts[0].Trim();
this.Location = parts[1].Trim();
}
}
public string Genre { get; set; }
public string Performer { get; set; }
public string Location { get; set; }
public DateTime Date { get; set; }
}
它有效,但它是UGLYYY