我尝试将我的所有源文件从目标文件夹转换为UTF-8(无BOM)编码。 我使用以下PowerShell脚本:
$MyPath = "D:\my projects\etc\"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
$content = Get-Content $_.FullName
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($_.FullName, $content, $Utf8NoBomEncoding)
}
cmd /c pause | out-null
如果文件已经在UTF-8中不是,则可以正常工作。但是,如果某个文件已经存在于UTF-8 no-BOM中,则所有本国符号都将转换为未知符号(例如,如果我再次运行该脚本)。如何更改脚本以解决问题?
答案 0 :(得分:1)
Ansgar Wiechers在评论中指出,问题在于 Windows PowerShell 在没有BOM的情况下默认将文件解释为“ ANSI”编码< / strong>,即旧系统语言环境(ANSI代码页)所隐含的编码,由 鉴于您的后续评论,输入文件中的[System.Text.Encoding]::Default
中的.NET Framework(而不是.NET Core )反映。 >
使用-Encoding Utf8
读取每个文件,并测试结果字符串是否包含Unicode REPLACEMENT CHARACTER (U+FFFD
)。如果是这样,则表示文件不是UTF-8,因为此特殊字符用于表示遇到了在UTF-8中无效的字节序列。
如果该文件不是有效的UTF-8,只需不指定-Encoding
即可再次读取该文件,这将导致Windows PowerShell将文件解释为Windows-1251-编码,因为这就是系统语言环境所隐含的编码(代码页)。
$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
# Note:
# * the use of -Encoding Utf8 to first try to read the file as UTF-8.
# * the use of -Raw to read the entire file as a *single string*.
$content = Get-Content -Raw -Encoding Utf8 $_.FullName
# If the replacement char. is found in the content, the implication
# is that the file is NOT UTF-8, so read it again *without -Encoding*,
# which interprets the files as "ANSI" encoded (Windows-1251, in your case).
if ($content.Contains([char] 0xfffd)) {
$content = Get-Content -Raw $_.FullName
}
# Note the use of WriteAllText() in lieu of WriteAllLines()
# and that no explicit encoding object is passed, given that
# .NET *defaults* to BOM-less UTF-8.
# CAVEAT: There's a slight risk of data loss if writing back to the input
# file is interrupted.
[System.IO.File]::WriteAllText($_.FullName, $content)
}
更快的替代方法是将[IO.File]::ReadAllText()
与UTF-8编码对象一起使用,当遇到无效的UTF-8字节时,该对象会抛出异常 (PSv5 +语法):
$utf8EncodingThatThrows = [Text.UTF8Encoding]::new($false, $true)
# ...
try {
$content = [IO.File]::ReadAllText($_.FullName, $utf8EncodingThatThrows)
} catch [Text.DecoderFallbackException] {
$content = [IO.File]::ReadAllText($_.FullName, [Text.Encoding]::Default)
}
# ...
使上述解决方案适应PowerShell Core / .NET Core:
PowerShell Core 默认为(无BOM)UTF-8,因此仅省略-Encoding
不适用于读取ANSI编码的文件。
类似地,[System.Text.Encoding]::Default
总是报告.NET Core中的UTF-8。
因此,您必须手动确定活动系统区域设置的ANSI代码页并获取相应的编码对象:
$ansiEncoding = [Text.Encoding]::GetEncoding(
[int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
)
然后,您需要将此编码明确传递给Get-Content -Encoding
(Get-Content -Raw -Encoding $ansiEncoding $_.FullName
)或.NET方法([IO.File]::ReadAllText($_.FullName, $ansiEncoding)
)。
因此,如果某些UTF-8编码文件(已经) 不含BOM,则必须明确指示Get-Content
使用-Encoding Utf8
将它们视为UTF-8-否则,如果它们包含7位ASCII范围之外的字符,则会被误解:
$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
# Note:
# * the use of -Encoding Utf8 to ensure the correct interpretation of the input file
# * the use of -Raw to read the entire file as a *single string*.
$content = Get-Content -Raw -Encoding Utf8 $_.FullName
# Note the use of WriteAllText() in lieu of WriteAllLines()
# and that no explicit encoding object is passed, given that
# .NET *defaults* to BOM-less UTF-8.
# CAVEAT: There's a slight risk of data loss if writing back to the input
# file is interrupted.
[System.IO.File]::WriteAllText($_.FullName, $content)
}
注意:在您的方案中,无需BOM的UTF-8文件无需重写,但这样做是有益的,并且可以简化代码。 替代将测试每个文件的前3个字节是否为UTF-8 BOM ,并跳过这样的文件:
$hasUtf8Bom = "$(Get-Content -Encoding Byte -First 3 $_.FullName)" -eq '239 187 191'
(Windows PowerShell)或
$hasUtf8Bom = "$(Get-Content -AsByteStream -First 3 $_.FullName)" -eq '239 187 191'
(PowerShell核心)。
顺便说一句:如果有使用非UTF8编码的输入文件(例如UTF-16),则只要这些文件具有BOM表,该解决方案就仍然有效,因为 PowerShell(安静地)将BOM的优先级高于通过-Encoding
指定的编码。
请注意,使用-Raw
/ WriteAllText()
来读取/写入整个文件 (单个字符串)不仅可以加快处理速度,而且可以确保以下操作每个输入文件的特征均保留:
通过对比,不使用-Raw
(逐行阅读)和使用.WriteAllLines()
并不会保留这些特征:您总是获得适合于平台的换行符(在Windows PowerShell中,始终为CRLF),并且总是在换行符结尾。
请注意,在读取文件时,多平台 Powershell Core 版本明智地默认为UTF-8 不带BOM且默认情况下还创建不含BOM的UTF-8文件-创建具有 BOM的UTF-8文件需要明确选择加入与-Encoding utf8BOM
。
因此, PowerShell Core 解决方案要简单得多:
# PowerShell Core only.
$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
# * Read the file at hand (UTF8 files both with and without BOM are
# read correctly).
# * Simply rewrite it with the *default* encoding, which in
# PowerShell Core is BOM-less UTF-8.
# Note the (...) around the Get-Content call, which is necessary in order
# to write back to the *same* file in the same pipeline.
# CAVEAT: There's a slight risk of data loss if writing back to the input
# file is interrupted.
(Get-Content -Raw $_.FullName) | Set-Content -NoNewline $_.FullName
}
上述解决方案有效,但是 Get-Content
和Set-Content
相对较慢,因此使用.NET类型读取和重写文件会更好。
如上所述,在以下解决方案中(即使在 Windows PowerShell 中也不必明确指定无需编码),因为 .NET本身值得称赞的默认值为BOM自诞生以来就使用了较少的UTF-8 (同时仍识别存在 if 的UTF-8 BOM):
$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
# CAVEAT: There's a slight risk of data loss if writing back to the input
# file is interrupted.
[System.IO.File]::WriteAllText(
$_.FullName,
[System.IO.File]::ReadAllText($_.FullName)
)
}
答案 1 :(得分:0)
正确检查BOM
的存在,例如使用以下模板(应用操作而不是对BOM
的评论):
$ps1scripts = Get-ChildItem .\*.ps1 -Recurse # change to match your circumstances
foreach ( $ps1script in $ps1scripts ) {
$first3 = $ps1script | Get-Content -Encoding byte -TotalCount 3
$first3Hex = '{0:X2}{1:X2}{2:X2}' -f $first3[0],$first3[1],$first3[2]
$first2Hex = '{0:x2}{1:x2}' -f $first3[0],$first3[1]
if ( $first3Hex -eq 'EFBBBF' ) {
# UTF-8 BOM
} elseif ( $first2Hex -eq 'fffe' ) {
# UCS-2LE BOM
} elseif ( $first2Hex -eq 'feff' ) {
# UCS-2BE BOM
} else {
# unknown (no BOM)
}
}
请注意,以上模板是从我的旧脚本派生而来的;您可以按如下方式更改第一行:
$MyPath = "D:\my projects\etc\"
$ps1scripts = Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c