HTML Agility Pack - 删除嵌套的strong和em标记

时间:2018-02-08 16:31:09

标签: c# html html-agility-pack

我需要从HTML中删除嵌套的粗体和斜体标记,但保留内容并保留顶级粗体或斜体标记。

例如以下内容:

<p><strong>Some text has<strong>been</strong>made bold</strong> and some text not bold</p>

会变成这样:

<p><strong>Some text has been made bold</strong> and some text not bold</p>

此外,这需要适用于多个嵌套标记,因此如下:

<p><strong>Some text<strong> has<strong>been</strong>made bold</strong></strong> and some text not bold</p>

也会变成:

<p><strong>Some text has been made bold</strong> and some text not bold</p>

我开始使用HTML Agility pack编写以下内容,虽然只有1个嵌套的粗体标记,但是当有多个嵌套标记时,它似乎无法正常工作:

// Loop through bold and italic tags
List<string> boldAndItalicTagNames = new List<string>() { "strong", "em" };
var boldAndItalicTags = htmlDoc.DocumentNode.SelectNodes(string.Join("|", 
boldAndItalicTagNames.Select(x => "//" + x)));
if (boldAndItalicTags != null)
{
    foreach (var tag in boldAndItalicTags)
    {
        // If tag doensn't have any child nodes (i.e. it is empty)
        if (!tag.HasChildNodes)
        {
            // Remove child and continue to next iteration
            tag.ParentNode.RemoveChild(tag);
            continue;
        }

        // If tag has children of same type (i.e. if strong tag has children strong tags)
        var childrenOfSameType = tag.ChildNodes.Where(x => x.Name == tag.Name).ToList();
        if (childrenOfSameType.Any())
        {
            // Loop through child nodes
            for (var i = childrenOfSameType.Count - 1; i >= 0; i--)
            {
                // Get child node and remove tags but keep content
                var child = childrenOfSameType[i];
                child.ParentNode.RemoveChild(child, true);
            }
        }
    }
}

1 个答案:

答案 0 :(得分:0)

我最终得到了以下这个问题的解决方案,它可能不是最好的解决方案,但它似乎正在起作用,因此除非有人能想到更好的东西,否则它现在必须要做。

497,50,2008-08-02T16:56:53Z,469,4,"foo bar 
foo

bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"