Question

我有以下仍包含一些HTML代码的文本：

v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}


Hi There,
 
For the product team to have any chance in analysing this issue we need clarification on how to reproduce the problem.

此刻我的代码是：

string replacedEmailText = Regex.Replace(emailText, @"<(.|\n)*?>", string.Empty);
string finalText = WebUtility.HtmlDecode(replacedEmailText);

如何删除包含的顶部行：

v\:* {behavior:url(#default#VML);}

？

Answer 1

对于此特定示例，您可以使用.value()作为替换模式。

但是，当文本包含序列.*;}(\r\n|\r|\n)*时，这将失败。如果可能的话，您可能希望进一步详细了解html行的外观：

;}

说明：

.*\(#default#VML\);}(\r\n|\r|\n)*：匹配换行符以外的任何字符，并且回车符连续连续零次返回
.*：匹配序列（＃default＃VML）
\(#default#VML\);}：删除新行，并且回车连续零次或更多次

演示here

Answer 2

请勿尝试使用正则表达式从文本中剥离HTML，而应使用诸如https://github.com/mganss/HtmlSanitizer之类的白名单库

从字符串中删除HTML-注释

2 个答案: