使用MSXML以UTF-8保存XML

时间:2010-04-07 21:31:12

标签: xml localization vbscript utf-8 msxml

我正在尝试加载一个简单的Xml文件(以UTF-8编码):

<?xml version="1.0" encoding="UTF-8"?>
<Test/>

并在vbscript中使用MSXML保存:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")

xmlDoc.Load("C:\test.xml")

xmlDoc.Save "C:\test.xml" 

问题是,MSXML以ANSI而不是UTF-8保存文件(尽管原始文件是以UTF-8编码的。)

MSDN docs for MSXML表示save()将以定义XML的任何编码写入文件:

  

字符编码基于XML声明中的encoding属性,例如。如果未指定编码属性,则默认设置为UTF-8。

但这显然不适用于我的机器。

MSXML如何以UTF-8保存?

3 个答案:

答案 0 :(得分:3)

XML文件中没有任何非ANSI文本,因此无论是UTF-8还是ASCII编码都是相同的。在我的测试中,在向test.xml添加非ASCII文本之后,MSXML始终以UTF-8编码保存,并且如果有一个开始,也会写入BOM。

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/Byte_order_mark

答案 1 :(得分:3)

您在MSXML中使用另外两个类来将正确编码的XML写出到输出流。

这是我写入通用IStream的帮助方法:

class procedure TXMLHelper.WriteDocumentToStream(const Document60: IXMLDOMDocument2; const stream: IStream; Encoding: string = 'UTF-8');
var
    writer: IMXWriter;
    reader: IVBSAXXMLReader;
begin
{
    From http://support.microsoft.com/kb/275883
    INFO: XML Encoding and DOM Interface Methods

    MSXML has native support for the following encodings:
        UTF-8
        UTF-16
        UCS-2
        UCS-4
        ISO-10646-UCS-2
        UNICODE-1-1-UTF-8
        UNICODE-2-0-UTF-16
        UNICODE-2-0-UTF-8

    It also recognizes (internally using the WideCharToMultibyte API function for mappings) the following encodings:
        US-ASCII
        ISO-8859-1
        ISO-8859-2
        ISO-8859-3
        ISO-8859-4
        ISO-8859-5
        ISO-8859-6
        ISO-8859-7
        ISO-8859-8
        ISO-8859-9
        WINDOWS-1250
        WINDOWS-1251
        WINDOWS-1252
        WINDOWS-1253
        WINDOWS-1254
        WINDOWS-1255
        WINDOWS-1256
        WINDOWS-1257
        WINDOWS-1258
}

    if Document60 = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: Document60 cannot be nil');
    if stream = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: stream cannot be nil');

    // Set properties on the XML writer - including BOM, XML declaration and encoding
    writer := CoMXXMLWriter60.Create;
    writer.byteOrderMark := True; //Determines whether to write the Byte Order Mark (BOM). The byteOrderMark property has no effect for BSTR or DOM output. (Default True)
    writer.omitXMLDeclaration := False; //Forces the IMXWriter to skip the XML declaration. Useful for creating document fragments. (Default False)
    writer.encoding := Encoding; //Sets and gets encoding for the output. (Default "UTF-16")
    writer.indent := True; //Sets whether to indent output. (Default False)
    writer.standalone := True;

    // Set the XML writer to the SAX content handler.
    reader := CoSAXXMLReader60.Create;
    reader.contentHandler := writer as IVBSAXContentHandler;
    reader.dtdHandler := writer as IVBSAXDTDHandler;
    reader.errorHandler := writer as IVBSAXErrorHandler;
    reader.putProperty('http://xml.org/sax/properties/lexical-handler', writer);
    reader.putProperty('http://xml.org/sax/properties/declaration-handler', writer);


    writer.output := stream; //The resulting document will be written into the provided IStream

    // Now pass the DOM through the SAX handler, and it will call the writer
    reader.parse(Document60);

    writer.flush;
end;

为了保存到文件,我使用 FileStream 调用 Stream 版本:

class procedure TXMLHelper.WriteDocumentToFile(const Document60: IXMLDOMDocument2; const filename: string; Encoding: string='UTF-8');
var
    fs: TFileStream;
begin
    fs := TFileStream.Create(filename, fmCreate or fmShareDenyWrite);
    try
        TXMLHelper.WriteDocumentToStream(Document60, fs, Encoding);
    finally
        fs.Free;
    end;
end;

您可以将功能转换为您喜欢的任何语言。这些是德尔福。

答案 2 :(得分:1)

执行load msxml时,不会将编码从处理指令复制到创建的文档中。所以它不包含任何编码,似乎msxml选择它喜欢的东西。在我的环境中,我不喜欢UTF-16。

解决方案是提供处理指令并在那里指定编码。如果您知道该文档没有处理说明,则代码很简单:

Set pi = xmlDoc.createProcessingInstruction("xml", _
         "version=""1.0"" encoding=""windows-1250""")
If xmlDoc.childNodes.Length > 0 Then
  Call xmlDoc.insertBefore(pi, xmlDoc.childNodes.Item(0))
End If

如果可能,文档包含其他处理指令,则必须先将其删除(因此下面的代码必须在上面的代码之前)。我不知道如何使用selectNode来完成它,所以我只是迭代了所有根节点:

For ich=xmlDoc.childNodes.Length-1 to 0 step -1
  Set ch = xmlDoc.childNodes.Item(ich)
  If ch.NodeTypeString = "processinginstruction" and ch.NodeName = "xml" Then
    xmlDoc.removeChild(ch)
  End If
Next ich

很抱歉,如果代码没有直接执行,因为我修改了工作版本,这是用自定义编写的,而不是vbscript。