如何抓取多个选择器并将其分组

时间:2018-11-19 00:22:32

标签: c# .net web-scraping css-selectors html-agility-pack

我要抓取此页面:https://www.g2crowd.com/products/google-analytics/reviews(出于我自己的教育目的)

    // @nuget: HtmlAgilityPack
using System;
using HtmlAgilityPack;

public class Program
{
    public static void Main()
    {
        HtmlWeb web = new HtmlWeb();
        HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
        var textNodes = html.DocumentNode.SelectNodes("//h3[contains(@class,'review-list-heading')]");
        if (textNodes != null)
            foreach (var t in textNodes)
                Console.WriteLine(t.InnerText);
    }
}

到目前为止,这就是我的意思,这使每条评论的标题都完美。但是,我将如何在标题和审阅主体中寻找一个新的标题-明确指出每个审阅都是独立的?

评论“正文”(含义文字)为:  //*[@id="pjax-container"]/div[2]/div[2]/div[6]/div[3]/div/div/div[2]/div[2]/div/div在xpath中。

或以纯HTML格式<div itemprop="reviewBody">

这是我目前拥有的{@ 3}}

请问我是否不够清楚。

1 个答案:

答案 0 :(得分:0)

选择父容器<div class="mb-2 border-bottom">,然后选择子容器

// @nuget: HtmlAgilityPack
using System;
using HtmlAgilityPack;

public class Program
{
    public static void Main()
    {
        HtmlWeb web = new HtmlWeb();
        HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
        var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
        if (divNodes != null)
        {
            foreach (var child in divNodes)
            {
                var allowedTags = child.SelectNodes(".//h3 | .//h5 | .//p");
                foreach (var tag in allowedTags)
                    Console.WriteLine(tag.InnerText);
                Console.WriteLine("======================================");
            }
        }
    }
}