Html Agility Pack - 删除元素,但不删除innerHtml

时间:2012-08-23 13:22:26

标签: c# html html-agility-pack

我可以通过note.Remove()来轻松删除元素:

HtmlDocument html = new HtmlDocument();

html.Load(Server.MapPath(@"~\Site\themes\default\index.cshtml"));

foreach (var item in html.DocumentNode.SelectNodes("//removeMe"))
{
    item.Remove();
}

但这也删除了innerHtml。 如果我只想删除标签,并保留innerHtml?

,该怎么办?

示例:

<ul>
    <removeMe>
        <li>
            <a href="#">Keep me</a>
        </li>
    </removeMe>
</ul>

任何帮助将不胜感激:)

10 个答案:

答案 0 :(得分:20)

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

var node = doc.DocumentNode.SelectSingleNode("//removeme");
node.ParentNode.RemoveChild(node, true);

答案 1 :(得分:3)

这应该有效:

foreach (var item in doc.DocumentNode.SelectNodes("//removeMe"))
{
    if (item.PreviousSibling == null)
    {
        //First element -> so add it at beginning of the parent's innerhtml
        item.ParentNode.InnerHtml = item.InnerHtml + item.ParentNode.InnerHtml;
    }
    else
    {
        //There is an element before itemToRemove -> add the innerhtml after the previous item
        foreach(HtmlNode node in item.ChildNodes){
            item.PreviousSibling.ParentNode.InsertAfter(node, item.PreviousSibling);
        }
    }
    item.Remove();
}

答案 2 :(得分:3)

bool KeepGrandChildren实现的问题可能是那些可能包含他们试图删除的元素的文本的人。如果removeme标记中包含文本,则文本也将被删除。例如,<removeme>text<p>more text</p></removeme>将成为<p>more text</p>

试试这个:

private static void RemoveElementKeepText(HtmlNode node)
    {
        //node.ParentNode.RemoveChild(node, true);
        HtmlNode parent = node.ParentNode;
        HtmlNode prev = node.PreviousSibling;
        HtmlNode next = node.NextSibling;

        foreach (HtmlNode child in node.ChildNodes)
        {
            if (prev != null)
                parent.InsertAfter(child, prev);
            else if (next != null)
                parent.InsertBefore(child, next);
            else
                parent.AppendChild(child);

        }
        node.Remove();
    }

答案 3 :(得分:1)

有一种简单的方法:

 element.InnerHtml = element.InnerHtml.Replace("<br>", "{1}"); 
 var innerTextWithBR = element.InnerText.Replace("{1}", "<br>");

答案 4 :(得分:1)

添加我的两分钱,因为这些方法都没有处理我想要的东西(删除一组给定的标签,如pdiv,并在保留内部标签的同时正确处理嵌套)。

以下是我提出的内容,并将我所考虑的大部分案例中的所有单元测试通过:

var htmlDoc = new HtmlDocument();

// load html
htmlDoc.LoadHtml(html);

var tags = (from tag in htmlDoc.DocumentNode.Descendants()
           where tagNames.Contains(tag.Name)
           select tag).Reverse();

// find formatting tags
foreach (var item in tags)
{
    if (item.PreviousSibling == null)
    {
        // Prepend children to parent node in reverse order
        foreach (HtmlNode node in item.ChildNodes.Reverse())
        {
            item.ParentNode.PrependChild(node);
        }                        
    }
    else
    {
        // Insert children after previous sibling
        foreach (HtmlNode node in item.ChildNodes)
        {
            item.ParentNode.InsertAfter(node, item.PreviousSibling);
        }
    }

    // remove from tree
    item.Remove();
}

// return transformed doc
html = htmlDoc.DocumentNode.WriteContentTo().Trim();

以下是我以前测试的案例:

[TestMethod]
public void StripTags_CanStripSingleTag()
{
    var input = "<p>tag</p>";
    var expected = "tag";
    var actual = HtmlUtilities.StripTags(input, "p");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_CanStripNestedTag()
{
    var input = "<p>tag <p>inner</p></p>";
    var expected = "tag inner";
    var actual = HtmlUtilities.StripTags(input, "p");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_CanStripTwoTopLevelTags()
{
    var input = "<p>tag</p> <div>block</div>";
    var expected = "tag block";
    var actual = HtmlUtilities.StripTags(input, "p", "div");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_CanStripMultipleNestedTags_2LevelsDeep()
{
    var input = "<p>tag <div>inner</div></p>";
    var expected = "tag inner";
    var actual = HtmlUtilities.StripTags(input, "p", "div");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_CanStripMultipleNestedTags_3LevelsDeep()
{
    var input = "<p>tag <div>inner <p>superinner</p></div></p>";
    var expected = "tag inner superinner";
    var actual = HtmlUtilities.StripTags(input, "p", "div");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_CanStripTwoTopLevelMultipleNestedTags_3LevelsDeep()
{
    var input = "<p>tag <div>inner <p>superinner</p></div></p> <div><p>inner</p> toplevel</div>";
    var expected = "tag inner superinner inner toplevel";
    var actual = HtmlUtilities.StripTags(input, "p", "div");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_IgnoresTagsThatArentSpecified()
{
    var input = "<p>tag <div>inner <a>superinner</a></div></p>";
    var expected = "tag inner <a>superinner</a>";
    var actual = HtmlUtilities.StripTags(input, "p", "div");

    Assert.AreEqual(expected, actual);

    input = "<wrapper><p>tag <div>inner</div></p></wrapper>";
    expected = "<wrapper>tag inner</wrapper>";
    actual = HtmlUtilities.StripTags(input, "p", "div");

    Assert.AreEqual(expected, actual);
}

[TestMethod]
public void StripTags_CanStripSelfClosingAndUnclosedTagsLikeBr()
{
    var input = "<p>tag</p><br><br/>";
    var expected = "tag";
    var actual = HtmlUtilities.StripTags(input, "p", "br");

    Assert.AreEqual(expected, actual);
}

它可能无法处理所有事情,但它可以满足我的需求。

答案 5 :(得分:0)

也许这可能就是你要找的东西?

foreach (HtmlNode node in html.DocumentNode.SelectNodes("//removeme"))
{
    HtmlNodeCollection children = node.ChildNodes; //get <removeme>'s children
    HtmlNode parent = node.ParentNode; //get <removeme>'s parent
    node.Remove(); //remove <removeme>
    parent.AppendChildren(children); //append the children to the parent
}

编辑:L.B的回答更清晰。跟他去吧!

答案 6 :(得分:0)

这个怎么样?

var removedNodes = document.SelectNodes("//removeme");
if(removedNodes != null)
    foreach(var rn in removedNodes){
        HtmlTextNode innernodes =document.CreateTextNode(rn.InnerHtml);
        rn.ParnetNode.ReplaceChild(innernodes, rn);
    }

答案 7 :(得分:0)

通常正确的表达式为node.ParentNode.RemoveChildren(node, true)

由于HtmlNode.RemoveChildren()http://htmlagilitypack.codeplex.com/discussions/79587)中的排序错误,我创建了一个类似的方法。对不起它在VB中。如果有人想要翻译,我会写一个。

'The HTML Agility Pack (1.4.9) includes the HtmlNode.RemoveChild() method but it has an ordering bug with preserving child nodes.  
'The below implementation orders children correctly.
Private Shared Sub RemoveNode(node As HtmlAgilityPack.HtmlNode, keepChildren As Boolean)
    Dim parent = node.ParentNode
    If keepChildren Then
        For i = node.ChildNodes.Count - 1 To 0 Step -1
            parent.InsertAfter(node.ChildNodes(i), node)
        Next
    End If
    node.Remove()
End Sub

我已使用以下测试标记测试了此代码:

<removeme>
    outertextbegin
    <p>innertext1</p>
    <p>innertext2</p>
    outertextend
</removeme>

输出结果为:

outertextbegin
<p>innertext1</p>
<p>innertext2</p>
outertextend

答案 8 :(得分:0)

这是C#版本-14年12月3日17:57发布的答案-伪编码器

该网站不允许我发表评论并添加到原始帖子中。也许会帮助某人。

private void removeNode(HtmlAgilityPack.HtmlNode node, bool keepChildren)
{
    var parent = node.ParentNode;
    if (keepChildren)
    {
        for ( int i = node.ChildNodes.Count - 1; i >= 0; i--)
        {
            parent.InsertAfter(node.ChildNodes[i], node);
        }            
    }
    node.Remove(); 
}

答案 9 :(得分:-3)

你可以用正则表达式做什么,或者你需要用htmlagilitypack做什么?

string html = "<ul><removeMe><li><a href="#">Keep me</a></li></removeMe></ul>";

html = Regex.Replace(html, "<removeMe.*?>", "", RegexOptions.Compiled);
html = Regex.Replace(html, "</removeMe>", "", RegexOptions.Compiled);