Question

我想使用HTML敏捷包来解析HTML页面中的图像和href链接，但我对XML或XPath不太了解。尽管在许多网站上查找了帮助文档，我只是可以'解决这个问题。另外，我在VisualStudio 2005中使用C＃。我只是不会说流利的英语，所以，我真诚地感谢你能写一些有用的代码。

Answer 1

主页上的first example做了类似的事情，但请考虑：

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    string href = link["href"].Value;
    // store href somewhere
 }

所以你可以想象，对于img @ src，只需用a替换每个img，用href替换src。您甚至可以简化为：

 foreach(HtmlNode node in doc.DocumentElement
              .SelectNodes("//a/@href | //img/@src")
 {
    list.Add(node.Value);
 }

对于相对网址处理，请查看Uri类。

Answer 2

示例和接受的答案是错误的。它不能使用最新版本进行编译。我尝试了别的东西：

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

这适合我。

Answer 3

也许我来不及发表回答。以下对我有用：

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

Answer 4

您还需要考虑文档基本网址元素（<base>）和协议相对网址（例如//www.foo.com/bar/）。

有关详细信息，请查看：

<base>: The Document Base URL element页面
The Protocol-relative URL Paul Irish的文章
What are the recommendations for html tag?关于StackOverflow的讨论
Uri Constructor (Uri, Uri)页面
Uri class doesn't handle the protocol-relative URL讨论没有StackOverflow

Answer 5

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

来源： https://html-agility-pack.net/select-nodes

如何使用Html Agility Pack获取img / src或/ hrefs？

5 个答案: