我需要从HTML中删除嵌套的粗体和斜体标记,但保留内容并保留顶级粗体或斜体标记。
例如以下内容:
<p><strong>Some text has<strong>been</strong>made bold</strong> and some text not bold</p>
会变成这样:
<p><strong>Some text has been made bold</strong> and some text not bold</p>
此外,这需要适用于多个嵌套标记,因此如下:
<p><strong>Some text<strong> has<strong>been</strong>made bold</strong></strong> and some text not bold</p>
也会变成:
<p><strong>Some text has been made bold</strong> and some text not bold</p>
我开始使用HTML Agility pack编写以下内容,虽然只有1个嵌套的粗体标记,但是当有多个嵌套标记时,它似乎无法正常工作:
// Loop through bold and italic tags
List<string> boldAndItalicTagNames = new List<string>() { "strong", "em" };
var boldAndItalicTags = htmlDoc.DocumentNode.SelectNodes(string.Join("|",
boldAndItalicTagNames.Select(x => "//" + x)));
if (boldAndItalicTags != null)
{
foreach (var tag in boldAndItalicTags)
{
// If tag doensn't have any child nodes (i.e. it is empty)
if (!tag.HasChildNodes)
{
// Remove child and continue to next iteration
tag.ParentNode.RemoveChild(tag);
continue;
}
// If tag has children of same type (i.e. if strong tag has children strong tags)
var childrenOfSameType = tag.ChildNodes.Where(x => x.Name == tag.Name).ToList();
if (childrenOfSameType.Any())
{
// Loop through child nodes
for (var i = childrenOfSameType.Count - 1; i >= 0; i--)
{
// Get child node and remove tags but keep content
var child = childrenOfSameType[i];
child.ParentNode.RemoveChild(child, true);
}
}
}
}
答案 0 :(得分:0)
我最终得到了以下这个问题的解决方案,它可能不是最好的解决方案,但它似乎正在起作用,因此除非有人能想到更好的东西,否则它现在必须要做。
497,50,2008-08-02T16:56:53Z,469,4,"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"