我安装了fckeditor,当从MS Word粘贴时,它添加了许多不必要的格式。我想保留一些像粗体,斜体,公牛等等的东西。我已经在网上搜索并提出了一些解决方案,即使是我希望保留的大胆和斜体,也能解决所有问题。有没有办法去掉不必要的单词格式?
public string CleanHtml(string html)
//Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
// Only returns acceptable HTML, and converts line breaks to <br />
// Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
// Does this have HTML tags?
if (html.IndexOf("<") >= 0)
// Make all tags lowercase
html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
return m.ToString().ToLower();
// Filter out anything except allowed tags
// Problem: this strips attributes, including href from a
http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
// Make all BR/br tags look the same, and trim them of whitespace before/after
html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
// No CRs
html = html.Replace("\r", "");
// Convert remaining LFs to line breaks
html = html.Replace("\n", "<br />");
// Trim BRs at the end of any string, and spaces on either side
return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
Public Shared Function CleanHtml(ByVal html As String) As String
'' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
'' Only returns acceptable HTML, and converts line breaks to <br />
'' Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
'' Does this have HTML tags?
If html.IndexOf("<") >= 0 Then
'' Make all tags lowercase
html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
'' Filter out anything except allowed tags
'' Problem: this strips attributes, including href from a
http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
'' Make all BR/br tags look the same, and trim them of whitespace before/after
html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
End If
'' No CRs
html = html.Replace(controlChars.CR, "")
'' Convert remaining LFs to line breaks
html = html.Replace(controlChars.LF, "<br />")
'' Trim BRs at the end of any string, and spaces on either side
Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
End Function
Public Shared Function LowerTag(m As Match) As String
Return m.ToString().ToLower()
End Function
在您的情况下,您需要修改“AcceptableTags”中“已批准”的HTML标记列表 - 代码仍将删除所有无用的属性(不幸的是,有用的属性如HREF和SRC,希望如此那些对你来说并不重要。)
但this code为我工作
静态字符串CleanWordHtml(string html){
StringCollection sc = new StringCollection(); // get rid of unnecessary tag spans (comments and title) sc.Add(@"<!--(\w|\W)+?-->"); sc.Add(@"<title>(\w|\W)+?</title>"); // Get rid of classes and styles sc.Add(@"\s?class=\w+"); sc.Add(@"\s+style='[^']+'"); // Get rid of unnecessary tags sc.Add( @"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>"); // Get rid of empty paragraph tags sc.Add(@"(<[^>]+>)+ (</\w+>)+"); // remove bizarre v: element attached to <img> tag sc.Add(@"\s+v:\w+=""[^""]+"""); // remove extra lines sc.Add(@"(\n\r){2,}"); foreach (string s in sc) { html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase); } return html; }
编辑:啊,我明白了。仔细观察Fckeditor网站,它是一个HTML编辑器,而不是我习惯的简单文本编辑器之一。将Paste from Word cleanup with autodetection
对于我的解决方案,我最后结合使用CleanHtml函数的C#版本和清除MS Office标签的部分。本质上是Glenn's进程的基于代码的版本。我会看到将其推到巨大的Excel电子表格时会发生什么情况。