Question

我正在尝试选择具有给定类的所有元素，并将其从HTML字符串中删除。

这是我到目前为止它似乎没有删除任何东西，虽然源清楚地显示了4个具有该类名的元素。

// Filter page HTML to display required content
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// filePath is a path to a file containing the html
htmlDoc.LoadHtml(pageHTML);

// ParseErrors is an ArrayList containing any errors from the Load statement);
if (!htmlDoc.ParseErrors.Any())
{
    // Remove all elements marked with pdf-ignore class
    HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//body[@class='pdf-ignore']");

    // Remove the collection from above
    foreach (var node in nodes)
    {
        node.Remove();
    }
}

编辑：只是为了澄清文档正在解析并且正在点击SelectNodes行，只是没有返回任何内容。

以下是html的片段：

<input type=\"submit\" name=\"ctl00$MainContent$PrintBtn\" value=\"Print Shotlist\" onclick=\"window.print();\" id=\"MainContent_PrintBtn\" class=\"pdf-ignore\">

Answer 1

编辑：在您更新的答案中，您发布了HTML字符串的一部分<input>元素声明，但您尝试将<body>元素与该类匹配pdf-ignore（根据您的表达式//body[@class='pdf-ignore']）。

如果要将文档中的所有元素与此类匹配，则应使用：

var nodes = htmlDoc.DocumentNode.SelectNodes("//*[contains(@class,'pdf-ignore')]");

获取节点的代码。这将匹配具有指定类名的所有元素。

除了一个细节：条件htmlDoc.ParseErrors == null之外，您的代码似乎是正确的。如果ParseErrors属性（IEnumerable<HtmlParseError>的类型）为null，则仅选择和删除节点，但实际上如果未发现错误，则此属性返回空列表。所以将代码更改为：

if (!htmlDoc.ParseErrors.Any())
{
    // some logic here
}

应该解决问题。

Answer 2

你的xpath可能不匹配：你试过"//div[class='pdf-ignore']"（没有"@"）吗？

从使用Agility Pack给出类的HTML中删除所有元素

2 个答案: