Question

我要将Lotus Notes数据库的内容迁移到SharePoint。整个数据库导出为XML文件（此要求无法更改），我必须解析这些XML文件并将数据插入SharePoint。

绊倒我的是包含富文本的元素。 XML元素包含使用DXL在Lotus Notes中的字段中使用的确切富文本格式的XML表示，如http://publib.boulder.ibm.com/infocenter/domhelp/v8r0/index.jsp?topic=%2Fcom.ibm.designer.domino.main.doc%2FH_PARAGRAPH_DEFINITIONS_ELEMENT_XML.html

中所述

我不需要保留文本的实际格式（除非这与检索纯文本一样简单），但如果我只是提取包含富文本的XML元素的值（使用LinqToXML），我得到没有换行的纯文本是不可接受的。此外，嵌入的图像在检索到的文本中显示为base64编码的字符串（它们嵌入在XML中）。

任何人都可以向我提供如何从XML元素中提取文本的指导，作为可以插入RTF文件的正确RTF格式，或者作为包含正确换行符并且不包含嵌入式的纯文本图象？

Answer 1

显然，您处理的XML是DXL。更优雅的方法是使用XSL转换将其转换为HTML。您可能会找到PD4ML tool提供的所需XSLT样式表。从HTML格式，文档可以转换为PDF，RTF或带有PD4ML的图像（或者可能使用其他工具转换为其他格式）

Answer 2

您可以将富文本项内容转换为HTML / MIME，这是富文本项的其他支持格式。

或者您可以创建一个XPage或表单，在HTTP URL中显示富文本内容，并在导出XML中引用它。

PANU

Answer 3

我（现在）只使用带有以下表达式的Regex剥离了所有XML标记和不需要的嵌入元素的richtext xml元素：

        //Removes all attachmentref elements
        newString = new Regex(@"(<attachmentref(.|\n)*</attachmentref>)").Replace(newString, "");
        //Removes all formula elements
        newString = new Regex(@"(<formula(.|\n)*</formula>)").Replace(newString, "");
        //Removes all xml tags (<par>, <pardef>, <table> etc). Be aware that this also removes any content in the table
        newString = new Regex("<(.)*/>").Replace(newString, "");
        newString = new Regex("<(.)*>").Replace(newString, "");
        newString = new Regex("</(.)*>").Replace(newString, ""); 

        //Trims the text to tidy up the many \n, \r and white-spaces introduced by removing the xml tags. 
        newString = new Regex(@"\r").Replace(newString, "\n");
        newString = new Regex(@"[ \f\r\t\v]+\n").Replace(newString, "\n");
        newString = new Regex(@"\n{2,}").Replace(newString, "\n");

        //makes < and > appear correctly in the text.
        newString = newString.Replace("&lt;", "<").Replace("&gt;", ">");

它不漂亮，但至少文字是可读的，并且保留了一些线条感。

从Lotus Notes XML富文本元素中提取文本

3 个答案: