我正在尝试将XML文件从Latin1转换为UTF-8,反之亦然。 我一直在做一些测试,但是我没能成功。 我正在使用
Get-Content C:\inputfile.xml | Set-Content -Encoding utf8 C:\outputfile.xml
但这并没有转变任何东西。所以我尝试在Get-Content
中提供编码,但在PowerShell中无法识别Latin1(或者错误信息告诉我的是)。
最好的方法是什么?
答案 0 :(得分:2)
最快的方法,特别是对于大型XML文件,使用.NET System.IO.File类。
将ReadAllText与明确提供的Latin-1编码一起使用:
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')) |
Set-Content r:\2.txt -Encoding UTF8
如果您的xml文件有<?xml version="1.0" encoding="iso-8859-1" ?>
,则还需要更改:
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::GetEncoding('iso-8859-1')).
Replace('<?xml version="1.0" encoding="iso-8859-1"',
'<?xml version="1.0" encoding="UTF-8"') |
Set-Content r:\2.txt -Encoding UTF8
使用明确提供的Latin-1编码使用WriteAllText编写Latin-1编码:
[IO.File]::WriteAllText(
'r:\2.txt',
[IO.File]::ReadAllText('r:\1.txt', [Text.Encoding]::UTF8).
Replace('<?xml version="1.0" encoding="UTF-8"',
'<?xml version="1.0" encoding="iso-8859-1"'),
[Text.Encoding]::GetEncoding('iso-8859-1')
)
可以处理任何大小文件的内存效率转码(1TB?没问题!):
function transcodeXML(
[ValidateScript({Test-Path -Literal $_})]
[string]$source,
[ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
[string]$sourceEncoding,
[ValidateScript({Test-Path -Literal $_ -IsValid})]
[string]$target,
[ValidateSet('IBM037', 'IBM437', 'IBM500', 'ASMO-708', 'DOS-720', 'ibm737', 'ibm775', 'ibm850', 'ibm852', 'IBM855', 'ibm857', 'IBM00858', 'IBM860', 'ibm861', 'DOS-862', 'IBM863', 'IBM864', 'IBM865', 'cp866', 'ibm869', 'IBM870', 'windows-874', 'cp875', 'shift_jis', 'gb2312', 'ks_c_5601-1987', 'big5', 'IBM1026', 'IBM01047', 'IBM01140', 'IBM01141', 'IBM01142', 'IBM01143', 'IBM01144', 'IBM01145', 'IBM01146', 'IBM01147', 'IBM01148', 'IBM01149', 'utf-16', 'utf-16BE', 'windows-1250', 'windows-1251', 'Windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'Johab', 'macintosh', 'x-mac-japanese', 'x-mac-chinesetrad', 'x-mac-korean', 'x-mac-arabic', 'x-mac-hebrew', 'x-mac-greek', 'x-mac-cyrillic', 'x-mac-chinesesimp', 'x-mac-romanian', 'x-mac-ukrainian', 'x-mac-thai', 'x-mac-ce', 'x-mac-icelandic', 'x-mac-turkish', 'x-mac-croatian', 'utf-32', 'utf-32BE', 'x-Chinese-CNS', 'x-cp20001', 'x-Chinese-Eten', 'x-cp20003', 'x-cp20004', 'x-cp20005', 'x-IA5', 'x-IA5-German', 'x-IA5-Swedish', 'x-IA5-Norwegian', 'us-ascii', 'x-cp20261', 'x-cp20269', 'IBM273', 'IBM277', 'IBM278', 'IBM280', 'IBM284', 'IBM285', 'IBM290', 'IBM297', 'IBM420', 'IBM423', 'IBM424', 'x-EBCDIC-KoreanExtended', 'IBM-Thai', 'koi8-r', 'IBM871', 'IBM880', 'IBM905', 'IBM00924', 'EUC-JP', 'x-cp20936', 'x-cp20949', 'cp1025', 'koi8-u', 'iso-8859-1', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-9', 'iso-8859-13', 'iso-8859-15', 'x-Europa', 'iso-8859-8-i', 'iso-2022-jp', 'csISO2022JP', 'iso-2022-jp', 'iso-2022-kr', 'x-cp50227', 'euc-jp', 'EUC-CN', 'euc-kr', 'hz-gb-2312', 'GB18030', 'x-iscii-de', 'x-iscii-be', 'x-iscii-ta', 'x-iscii-te', 'x-iscii-as', 'x-iscii-or', 'x-iscii-ka', 'x-iscii-ma', 'x-iscii-gu', 'x-iscii-pa', 'utf-7', 'utf-8')]
[string]$targetEncoding
) {
$reader = [IO.StreamReader]::new(
$source,
[Text.Encoding]::GetEncoding($sourceEncoding)
)
$writer = [IO.StreamWriter]::new(
$target,
$false, # don't append = overwrite
[Text.Encoding]::GetEncoding($targetEncoding)
)
$buf = [char[]]::new(16MB)
$nRead = $reader.Read($buf, 0, $buf.Length)
$writer.Write(
([regex]"(<\?xml [^>]*?encoding="")(?i)$sourceEncoding(?="")").Replace(
[string]::new($buf, 0, $nRead),
'$1' + $targetEncoding,
1 # speedup: one replacement only
)
)
while (!$reader.EndOfStream) {
$nRead = $reader.Read($buf, 0, $buf.Length)
$writer.Write($buf, 0, $nRead)
}
$reader.Close()
$writer.Close()
}
用法:
transcodeXML 'r:\1.xml' iso-8859-1 'r:\2.xml' utf-8
答案 1 :(得分:0)
我建议使用System.Xml.Linq.XDocument
方法将XML拉入Load
,然后更改Encoding
属性的Declaration
属性(https://msdn.microsoft.com/en-us/library/system.xml.linq.xdocument.declaration(v=vs.110).aspx)如果XDocument
为空,则添加Declaration
,最后您可以使用Save
方法保存更改的文档。