Question

我正在尝试删除任何重复或多次出现的＆lt;我的html文档中的br＆gt; 标签。这是我到目前为止提出的（真正愚蠢的代码）：

HtmlNodeCollection elements = nodeCollection.ElementAt(0)
                             .SelectNodes("//br");

if (elements != null)
{
    foreach (HtmlNode element in elements)
    {
        if (element.Name == "br")
        {
             bool iterate = true;
             while(iterate == true)
             {
                 iterate = removeChainElements(element);
             }
         }
     }
}

private bool removeChainElements(HtmlNode element)
{
    if (element.NextSibling != null && element.NextSibling.Name == "br")
    {
        element.NextSibling.Remove();
    }
    if (element.NextSibling != null && element.NextSibling.Name == "br")
         return true;
    else
         return false;
    }
}

代码确实找到 br 标记，但它根本不删除任何元素。

Answer 1

我认为你的解决方案太复杂了，尽管我的理解似乎是正确的。

假设，首先找到所有 个节点会更容易，只需删除那些前一个兄弟节点为 节点的节点。

让我们从下一个例子开始：

var html = @"<div>the first line<br /><br />the next one<br /></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);

现在找到 个节点并删除重复元素链：

var nodes = doc.DocumentNode.SelectNodes("//br").ToArray();
foreach (var node in nodes)
    if (node.PreviousSibling != null && node.PreviousSibling.Name == "br")
        node.Remove();

并得到它的结果：

var output = doc.DocumentNode.OuterHtml;

它是：

<div>the first line<br>the next one<br></div>

Answer 2

也许你可以这样做htmlsource = htmlSource.Replace(" ", );

或者类似的东西

    string html = "<br><br><br><br><br>";

    html = html.Replace("<br>", string.Empty);

    html = string.Format("{0}<br />", html);

    html = html.Replace(" ", string.Empty);
    html = html.Replace("\t", string.Empty);

使用HTML Agility Pack删除重复元素链

2 个答案: