Question

我有一个Html代码，我想将其转换为纯文本，但只保留彩色文本标签。例如：当我在Html下面时：

<body>

This is a <b>sample</b> html text.
<p align="center" style="color:#ff9999">this is only a sample<p>
....
and some other tags...
</body>
</html>

我想要输出：

this is a sample html text.
<#ff9999>this is only a sample<>
....
and some other tags...

Answer 1

可以使用正则表达式，但...... You should not parse (X)HTML with regex.

我带来解决问题的第一个正则表达式是：

<p(\w|\s|[="])+color:(#([0-9a-f]{6}|[0-9a-f]{3}))">(\w|\s)+</p>

第5组将是十六进制（3或6个十六进制）颜色，第6组将是标签内的文本。

显然，这不是最好的解决方案，因为我不是正则表达式的主人，显然它需要一些测试并且可能是一般性的......但是这仍然是一个很好的开始。

Answer 2

我使用解析器来解析像HtmlAgilityPack这样的HTML，并使用正则表达式来查找属性中的color值。

首先，使用xpath找到包含style属性且color定义的所有节点：

var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode
    .SelectNodes("//*[contains(@style, 'color')]")
    .ToArray();

然后最简单的正则表达式匹配颜色值：(?<=color:\s*)#?\w+。

var colorRegex = new Regex(@"(?<=color:\s*)#?\w+", RegexOptions.IgnoreCase);

然后迭代这些节点，如果有正则表达式匹配，用html编码的标签替换节点的内部html（稍后你会理解为什么）：

foreach (var node in nodes)
{
    var style = node.Attributes["style"].Value;
    if (colorRegex.IsMatch(style))
    {
        var color = colorRegex.Match(style).Value;
        node.InnerHtml =
            HttpUtility.HtmlEncode("<" + color + ">") +
            node.InnerHtml +
            HttpUtility.HtmlEncode("</" + color + ">");
    }
}

最后获取文档的内部文本并对其执行html解码（这是因为内部文本剥离了所有标记）：

var txt = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);

这应该返回这样的内容：

This is a sample html text.
<#ff9999>this is only a sample</#ff9999>
....
and some other tags...

当然，您可以根据自己的需要进行改进。

在HTML代码中获取彩色文本

2 个答案: