Html Agility Pack从表中获取内容

时间:2014-06-26 00:54:46

标签: c# html web-scraping html-agility-pack

我需要从“http://anytimefitness.com/find-gym/list/AL”获取位置,地址和电话号码。到目前为止,我有这个......

    HtmlDocument htmlDoc = new HtmlDocument();

    htmlDoc.OptionFixNestedTags = true;
    htmlDoc.LoadHtml(stateURLs[0].ToString());

    var BlankNode = 
        htmlDoc.DocumentNode.SelectNodes("/div[@class='segmentwhite']/table[@style='width: 100%;']//tr[@class='']");

    var GrayNode = 
        htmlDoc.DocumentNode.SelectNodes("/div[@class='segmentwhite']/table[@style='width: 100%;']//tr[@class='gray_bk']");

我已经浏览了一段时间的stackoverflow,但目前关于htmlagilitypack的帖子都没有真正帮助过。我也一直在使用http://www.w3schools.com/xpath/xpath_syntax.asp

2 个答案:

答案 0 :(得分:1)

由于您所关注的<div>不是根节点的直接子节点,因此您需要使用//而不是/。然后,您可以使用BlankNode运算符组合GrayNodeor的XPath,例如:

var htmlweb = new HtmlWeb();
HtmlDocument htmlDoc = htmlweb.Load("http://anytimefitness.com/find-gym/list/AL");
htmlDoc.OptionFixNestedTags = true;

var AllNode =
        htmlDoc.DocumentNode.SelectNodes("//div[@class='segmentwhite']/table//tr[@class='' or @class='gray_bk']");
foreach (HtmlNode node in AllNode)
{
    var location = node.SelectSingleNode("./td[2]").InnerText;
    var address = node.SelectSingleNode("./td[3]").InnerText;
    var phone = node.SelectSingleNode("./td[4]").InnerText;

    //do something with above informations
}

答案 1 :(得分:0)

这是我在LinqPad中测试过的一个例子。

string url = @"http://anytimefitness.com/find-gym/list/AL";
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var html = Encoding.UTF8.GetString(data);

var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(html);

var gyms = htmlDoc.DocumentNode.SelectNodes("//tbody/tr[@class='' or @class='gray_bk']");
foreach (var gym in gyms) {
    var city = gym.SelectSingleNode("./td[2]").InnerText;
    var address = gym.SelectSingleNode("./td[3]").InnerText;
    var phone = gym.SelectSingleNode("./td[4]").InnerText;
}

由于HtmlAgilityPack也支持Linq,你也可以这样做:

string [] classes = {"", "gray_bk"};

var gyms = htmlDoc
        .DocumentNode
        .Descendants("tr")
        .Where(t => classes.Contains(t.Attributes["class"].Value))
        .ToList();

gyms.ForEach(gym => {
    var city = gym.SelectSingleNode("./td[2]").InnerText;
    var address = gym.SelectSingleNode("./td[3]").InnerText;
    var phone = gym.SelectSingleNode("./td[4]").InnerText;
});