Question

我想使用HTML敏捷包从youtubelplaylist-link中抓取一个href。这段代码有效，但问题是它是一张桌子，而且我不知道如何分别刮掉每个href。

            var html = new HtmlDocument();
        html.LoadHtml(new WebClient().DownloadString("https://www.youtube.com/playlist?list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC"));
        var root = html.DocumentNode;
        var p = root.Descendants()
            .Where(n => n.GetAttributeValue("class", "").Equals("pl-video-title"))
            .FirstOrDefault()
            .Descendants("a").Select(node => node.GetAttributeValue("href", ""))
            .FirstOrDefault();

            var points = ("https://youtube.com/embed/" + (Regex.Replace(p, "list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC&index=1", "").Trim()));

这段代码有效，但问题是它是一张桌子而且这个代码我只收到第一个href，而且我不知道如何在表格中单独刮掉每个href（大约10个）他们）。这是＆＃34;选择器/ ID /类＆＃34;我不想刮掉：

#pl-load-more-destination > tr:nth-child(1) > td.pl-video-title

当我把它放入＆＃34; pl-video-title＆＃34;我收到错误。

我一直在看XPath，但我无法让它发挥作用..

Answer 1

假设您想要播放列表链接/视频的href，可以使用以下内容获取：

（请注意，我使用ScrapySharp nuget库和HtmlAgilityPack来提供对css选择器的支持，使用CssSelect扩展名（添加using ScrapySharp.Extensions）

HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("https://www.youtube.com/playlist?list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC");

输出看起来像

/watch?v=9bZkp7q19f0&list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC&index=1 其中index参数根据列表中的链接数而变化。

如果您计划在进一步的抓取中使用它，请不要忘记将www.youtube.com添加到链接中（因为它不是从网站外部访问的有效uri，因为它不是绝对的）。

var links = htmlDoc.DocumentNode.CssSelect(".pl-video-title-link");
foreach (var link in links)
    Console.WriteLine(link.GetAttributeValue("href"));

<强>更新

要从url查询字符串中删除给定键，这是一种简单的方法：

    string url = "http://www.youtube.com/watch?v=bbEoRnaOIbs&list=PLirAqAtl_h2r5g8xGajEwdXd3x1sZh8hC&index=100";

    var parsedQs = HttpUtility.ParseQueryString(url);
    parsedQs.Remove("index");

    Console.WriteLine(parsedQs.ToString());

该网址将显示为

使用HTML Aligity包从youtube播放列表中截取href属性

1 个答案: