将XML latin1转换为UTF-8和其他方式

时间:2016-10-05 08:47:59

标签: xml powershell utf-8 converter iso-8859-1

我正在尝试将XML文件从Latin1转换为UTF-8,反之亦然。 我一直在做一些测试,但是我没能成功。 我正在使用

Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml

但这并没有转变任何东西。所以我尝试在Get-Content中提供编码,但在PowerShell中无法识别Latin1(或者错误信息告诉我的是)。 最好的方法是什么?

2 个答案:

答案 0 :(得分:2)

最快的方法,特别是对于大型XML文件,使用.NET System.IO.File类。

  • ReadAllText与明确提供的Latin-1编码一起使用:

    [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) | 
        Set-Content r:\2.txt -Encoding UTF8
    
  • 如果您的xml文件有<?xml version="1.0" encoding="iso-8859-1" ?>,则还需要更改:

    [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')).
        Replace('<?xml version="1.0" encoding="iso-8859-1"',
                '<?xml version="1.0" encoding="UTF-8"') | 
        Set-Content r:\2.txt -Encoding UTF8
    
  • 使用明确提供的Latin-1编码使用WriteAllText编写Latin-1编码:

    [IO.File]::WriteAllText(
        'r:\2.txt',
        [IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8).
            Replace('<?xml version="1.0" encoding="UTF-8"',
                    '<?xml version="1.0" encoding="iso-8859-1"'),
        [Text.Encoding]::GetEncoding('iso-8859-1')
    )
    
  • 可以处理任何大小文件的内存效率转码(1TB?没问题!):

    function transcodeXML(
        [ValidateScript({Test-Path -Literal $_})]
        [string]$source,
        [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
        [string]$sourceEncoding,
    
        [ValidateScript({Test-Path -Literal $_ -IsValid})]
        [string]$target,
        [ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
        [string]$targetEncoding
    ) {
        $reader = [IO.StreamReader]::new(
            $source,
            [Text.Encoding]::GetEncoding($sourceEncoding)
        )
        $writer = [IO.StreamWriter]::new(
            $target,
            $false, # don't append = overwrite
            [Text.Encoding]::GetEncoding($targetEncoding)
        )
        $buf = [char[]]::new(16MB)
    
        $nRead = $reader.Read($buf, 0, $buf.Length)
        $writer.Write(
            ([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace(
                [string]::new($buf, 0, $nRead),
                '$1' + $targetEncoding,
                1 # speedup: one replacement only
            )
        )
        while (!$reader.EndOfStream) {
            $nRead = $reader.Read($buf, 0, $buf.Length)
            $writer.Write($buf, 0, $nRead)
        }
        $reader.Close()
        $writer.Close()
    }
    

    用法:

    transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
    

答案 1 :(得分:0)

我建议使用System.Xml.Linq.XDocument方法将XML拉入Load,然后更改Encoding属性的Declaration属性(https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.declaration(v=vs.110).aspx)如果XDocument为空,则添加Declaration,最后您可以使用Save方法保存更改的文档。