Question

是否有免费的第三方或.NET类将HTML转换为RTF（用于启用富文本格式的Windows窗体控件）？

“免费”要求来自这样一个事实，即我只是在处理原型，只需加载BrowserControl并在需要时呈现HTML（即使它很慢）并且Developer Express将要发布他们很快就会控制自己。

我不想学习手工编写RTF，而且我已经知道HTML了，所以我认为这是快速获得一些可证明代码的最快方法。

Answer 1

实际上有一个简单的免费解决方案：使用您的浏览器，这就是我使用的技巧：

var webBrowser = new WebBrowser();
webBrowser.CreateControl(); // only if needed
webBrowser.DocumentText = *yourhtmlstring*;
while (_webBrowser.DocumentText != *yourhtmlstring*)
    Application.DoEvents();
webBrowser.Document.ExecCommand("SelectAll", false, null);
webBrowser.Document.ExecCommand("Copy", false, null);
*yourRichTextControl*.Paste();

这可能比其他方法慢，但至少它是免费的并且有效！

Answer 2

查看XHTML2RTF上的CodeProject文章。

Answer 3

扩展斯巴达克斯回答我实现了以下哪个有效！

    Using reportWebBrowser As New WebBrowser
        reportWebBrowser.CreateControl()
        reportWebBrowser.DocumentText = sbHTMLDoc.ToString
        While reportWebBrowser.DocumentText <> sbHTMLDoc.ToString
            Application.DoEvents()
        End While
        reportWebBrowser.Document.ExecCommand("SelectAll", False, Nothing)
        reportWebBrowser.Document.ExecCommand("Copy", False, Nothing)

        Using reportRichTextBox As New RichTextBox
            reportRichTextBox.Paste()
            reportRichTextBox.SaveFile(DocumentFileName)
        End Using
    End Using

Answer 4

当然不是很完美，但这里是我用来将HTML转换为纯文本的代码。

（我不是原作者，我是根据网络上的代码改编的）

public static string ConvertHtmlToText(string source) {

            string result;

            // Remove HTML Development formatting
            // Replace line breaks with space
            // because browsers inserts space
            result = source.Replace("\r", " ");
            // Replace line breaks with space
            // because browsers inserts space
            result = result.Replace("\n", " ");
            // Remove step-formatting
            result = result.Replace("\t", string.Empty);
            // Remove repeating speces becuase browsers ignore them
            result = System.Text.RegularExpressions.Regex.Replace(result,
                                                                  @"( )+", " ");

            // Remove the header (prepare first by clearing attributes)
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*head([^>])*>", "<head>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<( )*(/)( )*head( )*>)", "</head>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(<head>).*(</head>)", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // remove all scripts (prepare first by clearing attributes)
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*script([^>])*>", "<script>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<( )*(/)( )*script( )*>)", "</script>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            //result = System.Text.RegularExpressions.Regex.Replace(result, 
            //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
            //         string.Empty, 
            //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<script>).*(</script>)", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // remove all styles (prepare first by clearing attributes)
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*style([^>])*>", "<style>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"(<( )*(/)( )*style( )*>)", "</style>",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(<style>).*(</style>)", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // insert tabs in spaces of <td> tags
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*td([^>])*>", "\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // insert line breaks in places of <BR> and <LI> tags
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*br( )*>", "\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*li( )*>", "\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // insert line paragraphs (double line breaks) in place
            // if <P>, <DIV> and <TR> tags
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*div([^>])*>", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*tr([^>])*>", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<( )*p([^>])*>", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // Remove remaining tags like <a>, links, images,
            // comments etc - anything thats enclosed inside < >
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<[^>]*>", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            // replace special characters:
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&nbsp;", " ",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);

            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&bull;", " * ",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&lsaquo;", "<",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&rsaquo;", ">",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&trade;", "(tm)",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&frasl;", "/",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"<", "<",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @">", ">",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&copy;", "(c)",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&reg;", "(r)",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Remove all others. More can be added, see
            // http://hotwired.lycos.com/webmonkey/reference/special_characters/
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     @"&(.{2,6});", string.Empty,
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);


            // make line breaking consistent
            result = result.Replace("\n", "\r");

            // Remove extra line breaks and tabs:
            // replace over 2 breaks with 2 and over 4 tabs with 4. 
            // Prepare first to remove any whitespaces inbetween
            // the escaped characters and remove redundant tabs inbetween linebreaks
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)( )+(\r)", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\t)( )+(\t)", "\t\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\t)( )+(\r)", "\t\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)( )+(\t)", "\r\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Remove redundant tabs
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)(\t)+(\r)", "\r\r",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Remove multible tabs followind a linebreak with just one tab
            result = System.Text.RegularExpressions.Regex.Replace(result,
                     "(\r)(\t)+", "\r\t",
                     System.Text.RegularExpressions.RegexOptions.IgnoreCase);
            // Initial replacement target string for linebreaks
            string breaks = "\r\r\r";
            // Initial replacement target string for tabs
            string tabs = "\t\t\t\t\t";
            for (int index = 0; index < result.Length; index++) {
                result = result.Replace(breaks, "\r\r");
                result = result.Replace(tabs, "\t\t\t\t");
                breaks = breaks + "\r";
                tabs = tabs + "\t";
            }

            // Thats it.
            return result;

    }

Answer 5

也许你需要的是a control to edit the HTML？

Answer 6

TL; DR：我建议尽可能使用pip install pyOpenSSL格式和OpenXml nuget程序包。

Microsoft Word COM

我并没有对此主题进行太多搜索，因为我的用例是使用服务器上的功能，这使得COM组件不是一个很好的选择。

XHTML2RTF

正如@JonathanParker提到的，您可以使用此代码项目库。

缺点是：

受支持的有限HTML和CSS
不是真正的.NET
...

Windows Forms Web浏览器

如@Spartaco所述，您可以使用Windows窗体HtmlToOpenXml控件。

缺点是：

对System.Windows.Forms的引用
使用复制和粘贴（多线程有问题）
仅在STA线程中工作

不支持的功能包括：

字体
颜色
编号列表
删除线（WebBrowser元素）
...

DevExpress

来自devexpress support center的“ Paul V”的代码示例。（03.02.2015）

del

或者您可以使用this example中所示的public String ConvertRTFToHTML(String RTF) { MemoryStream ms = new MemoryStream(); StreamWriter writer = new StreamWriter(ms); writer.Write(RTF); writer.Flush(); ms.Position = 0; String output = ""; HtmlEditorExtension.Import(HtmlEditorImportFormat.Rtf, ms, (s, enumerable) => output = s); return output; } public String ConvertHTMLToRTF(String Html) { MemoryStream ms = new MemoryStream(); var editor = new ASPxHtmlEditor { Html = html }; editor.Export(HtmlEditorExportFormat.Rtf, ms); ms.Position = 0; StreamReader reader = new StreamReader(ms); return reader.ReadToEnd(); }类型。

license for devexpress的价格从1500.-USD左右滑落至2200.-USD。

未知实际支持的内容。

缺点是：

价格
很多关于一件小事的参考书
更多？

不支持的功能包括：

打击区（RichEditDocumentServer元素）

Sautinsoft

del

更多示例和配置选项可以在here和here中找到。

licence for this component的价格从400.- USD到2000.- USD。

Supported is the following：

HTML 3.2
HTML 4.01
HTML 5
CSS
XHTML

缺点是：

我不确定开发的积极程度
价格

使用知识库：

从trix angular editor转换编号列表会破坏直觉

DIY

如果您只想支持有限的功能，则可以编写自己的转换器。如果支持的功能集太大，我不建议这样做。

我有一个小的sample project here，但目前仅出于教育目的。

OpenXml

如果您的用例还可以使用OpenXml format，则可以使用HtmlToOpenXml nuget package。它免费且确实支持我测试过其他解决方案的所有功能。

The project基于Microsoft的Open Xml SDK，并且似乎很活跃。

public string ConvertHTMLToRTF(string html)
{
    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    return h.ConvertString(htmlString);
}

public string ConvertRTFToHTML(string rtf)
{
    SautinSoft.RtfToHtml r = new SautinSoft.RtfToHtml();
    byte[] bytes = Encoding.ASCII.GetBytes(rtf);
    r.OpenDocx(bytes );
    return r.ToHtml();
}

Link to example gist

Answer 7

我建议使用一个名为https://issuetracker.google.com/issues/126946582的控制台工具。它不是组件，而是巨大的转换包。我正在用它在HTML和LaTeX之间转换。太棒了。

您可以找到Pandoc的支持格式的完整列表。

为了将HTML文档转换为RTF格式，请在控制台上编写：

pandoc filename.html -f html -t rtf -s -o filename.rtf

如何在不支付组件费用的情况下将HTML转换为.NET中的RTF（富文本）？

7 个答案: