Question

我尝试将我的所有源文件从目标文件夹转换为UTF-8（无BOM）编码。我使用以下PowerShell脚本：

$MyPath = "D:\my projects\etc\"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    $content = Get-Content $_.FullName  
    $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
    [System.IO.File]::WriteAllLines($_.FullName, $content, $Utf8NoBomEncoding)    
}
cmd /c pause | out-null

如果文件已经在UTF-8中不是，则可以正常工作。但是，如果某个文件已经存在于UTF-8 no-BOM中，则所有本国符号都将转换为未知符号（例如，如果我再次运行该脚本）。如何更改脚本以解决问题？

Answer 1

Ansgar Wiechers在评论中指出，问题在于 Windows PowerShell 在没有BOM的情况下默认将文件解释为“ ANSI”编码< / strong>，即旧系统语言环境（ANSI代码页）所隐含的编码，由[System.Text.Encoding]::Default中的.NET Framework（而不是.NET Core ）反映。 >

鉴于您的后续评论，输入文件中的 -BOM-less文件是Windows-1251编码和UTF-8文件的混合 >，您必须检查其内容 ，以确定其特定的编码：

使用-Encoding Utf8读取每个文件，并测试结果字符串是否包含Unicode REPLACEMENT CHARACTER (U+FFFD)。如果是这样，则表示文件不是UTF-8，因为此特殊字符用于表示遇到了在UTF-8中无效的字节序列。
如果该文件不是有效的UTF-8，只需不指定-Encoding即可再次读取该文件，这将导致Windows PowerShell将文件解释为Windows-1251-编码，因为这就是系统语言环境所隐含的编码（代码页）。

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to first try to read the file as UTF-8.
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # If the replacement char. is found in the content, the implication
    # is that the file is NOT UTF-8, so read it again *without -Encoding*,
    # which interprets the files as "ANSI" encoded (Windows-1251, in your case).
    if ($content.Contains([char] 0xfffd)) {
      $content = Get-Content -Raw $_.FullName  
    }

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

更快的替代方法是将[IO.File]::ReadAllText()与UTF-8编码对象一起使用，当遇到无效的UTF-8字节时，该对象会抛出异常 （PSv5 +语法）：

$utf8EncodingThatThrows = [Text.UTF8Encoding]::new($false, $true)

# ...

  try {
     $content = [IO.File]::ReadAllText($_.FullName, $utf8EncodingThatThrows)
  } catch [Text.DecoderFallbackException] {         
     $content = [IO.File]::ReadAllText($_.FullName, [Text.Encoding]::Default)
  }

# ...

使上述解决方案适应PowerShell Core / .NET Core：

PowerShell Core 默认为（无BOM）UTF-8，因此仅省略-Encoding不适用于读取ANSI编码的文件。
< / li>
类似地，[System.Text.Encoding]::Default 总是报告.NET Core中的UTF-8。

因此，您必须手动确定活动系统区域设置的ANSI代码页并获取相应的编码对象：

$ansiEncoding = [Text.Encoding]::GetEncoding(
  [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
)

然后，您需要将此编码明确传递给Get-Content -Encoding（Get-Content -Raw -Encoding $ansiEncoding $_.FullName）或.NET方法（[IO.File]::ReadAllText($_.FullName, $ansiEncoding)）。

答案的原始形式：对于已经全部用UTF-8编码的输入文件：

因此，如果某些UTF-8编码文件（已经）不含BOM，则必须明确指示Get-Content使用-Encoding Utf8 将它们视为UTF-8-否则，如果它们包含7位ASCII范围之外的字符，则会被误解：

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # Note:
    #  * the use of -Encoding Utf8 to ensure the correct interpretation of the input file
    #  * the use of -Raw to read the entire file as a *single string*.
    $content = Get-Content -Raw -Encoding Utf8 $_.FullName  

    # Note the use of WriteAllText() in lieu of WriteAllLines()
    # and that no explicit encoding object is passed, given that
    # .NET *defaults* to BOM-less UTF-8.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    [System.IO.File]::WriteAllText($_.FullName, $content)    
}

注意：在您的方案中，无需BOM的UTF-8文件无需重写，但这样做是有益的，并且可以简化代码。替代将测试每个文件的前3个字节是否为UTF-8 BOM ，并跳过这样的文件：
$hasUtf8Bom = "$(Get-Content -Encoding Byte -First 3 $_.FullName)" -eq '239 187 191'（Windows PowerShell）或
$hasUtf8Bom = "$(Get-Content -AsByteStream -First 3 $_.FullName)" -eq '239 187 191'（PowerShell核心）。

顺便说一句：如果有使用非UTF8编码的输入文件（例如UTF-16），则只要这些文件具有BOM表，该解决方案就仍然有效，因为 PowerShell（安静地）将BOM的优先级高于通过-Encoding 指定的编码。

请注意，使用-Raw / WriteAllText()来读取/写入整个文件（单个字符串）不仅可以加快处理速度，而且可以确保以下操作每个输入文件的特征均保留：

特定的换行符样式（CRLF（Windows）与仅LF（Unix））
最后一行是否有尾随换行符。

通过对比，不使用-Raw（逐行阅读）和使用.WriteAllLines()并不会保留这些特征：您总是获得适合于平台的换行符（在Windows PowerShell中，始终为CRLF），并且总是在换行符结尾。

请注意，在读取文件时，多平台 Powershell Core 版本明智地默认为UTF-8 不带BOM且默认情况下还创建不含BOM的UTF-8文件-创建具有 BOM的UTF-8文件需要明确选择加入与-Encoding utf8BOM。

因此， PowerShell Core 解决方案要简单得多：

# PowerShell Core only.

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    # * Read the file at hand (UTF8 files both with and without BOM are 
    #   read correctly).
    # * Simply rewrite it with the *default* encoding, which in 
    #   PowerShell Core is BOM-less UTF-8.
    # Note the (...) around the Get-Content call, which is necessary in order
    # to write back to the *same* file in the same pipeline.
    # CAVEAT: There's a slight risk of data loss if writing back to the input
    #         file is interrupted.
    (Get-Content -Raw $_.FullName) | Set-Content -NoNewline $_.FullName
}

基于.NET类型的快速解决方案

上述解决方案有效，但是 Get-Content和Set-Content相对较慢，因此使用.NET类型读取和重写文件会更好。

如上所述，在以下解决方案中（即使在 Windows PowerShell 中也不必明确指定无需编码），因为 .NET本身值得称赞的默认值为BOM自诞生以来就使用了较少的UTF-8 （同时仍识别存在 if 的UTF-8 BOM）：

$MyPath = "D:\my projects\etc"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
  # CAVEAT: There's a slight risk of data loss if writing back to the input
  #         file is interrupted.
  [System.IO.File]::WriteAllText(
    $_.FullName,
    [System.IO.File]::ReadAllText($_.FullName)
  )   
}

Answer 2

正确检查BOM的存在，例如使用以下模板（应用操作而不是对BOM的评论）：

$ps1scripts = Get-ChildItem .\*.ps1 -Recurse      # change to match your circumstances
foreach ( $ps1script in $ps1scripts ) {
    $first3 = $ps1script | Get-Content -Encoding byte -TotalCount 3
    $first3Hex = '{0:X2}{1:X2}{2:X2}' -f $first3[0],$first3[1],$first3[2]
    $first2Hex = '{0:x2}{1:x2}'       -f $first3[0],$first3[1]

    if ( $first3Hex -eq 'EFBBBF' )     {
        # UTF-8 BOM

    } elseif ( $first2Hex -eq 'fffe' ) {
        # UCS-2LE BOM

    } elseif ( $first2Hex -eq 'feff' ) {
        # UCS-2BE BOM

    } else {
        # unknown (no BOM)

    }
}

请注意，以上模板是从我的旧脚本派生而来的；您可以按如下方式更改第一行：

$MyPath = "D:\my projects\etc\"
$ps1scripts = Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c

在没有BOM的情况下将源转换为UTF-8

2 个答案:

答案的原始形式：对于已经全部用UTF-8编码的输入文件：

基于.NET类型的快速解决方案