Question

我有一个用atom创建的html文件 test.html ，其中包含：

Testé编码utf-8

当我用Powershell控制台阅读它时（我使用的是法语Windows）

Get-Content -Raw test.html

我回来了：

TestÃ© encoding utf-8

为什么重音字符打印不正确？

Answer 1

Atom editor默认情况下创建UTF-8文件而不 pseudo-BOM （这是正确的做法，从跨平台的观点）。
- 其他流行的跨平台编辑器，例如Visual Studio Code和Sublime Text，行为方式相同。
Windows PowerShell ^[1]仅识别带有伪BOM 的UTF-8文件。

如果没有伪BOM，PowerShell会根据系统的遗留代码页将文件解释为格式化，例如美国系统上的Windows-1252。
（这也是记事本使用的默认编码，它调用＆＃34; ANSI＆＃34;，不仅在读取文件时，而且在创建时也是如此。相比之下，PowerShell 默认创建 UTF-16LE编码文件。）

因此，要使Get-Content在Windows PowerShell中正确识别无BOM UTF-8文件，您必须使用{{ 1}}

^{[1]相比之下，跨平台PowerShell Core版本值得称道默认为UTF-8，无论是阅读还是写作，因此它确实解释了UTF-8-即使没有BOM也能正确编码文件，默认情况下也会创建没有BOM的文件。}

Answer 2

# Created a UTF-8 Sig File 
notepad .\test.html

# Get File contents with/without -raw
cat .\test.html;Get-Content -Raw .\test.html
Testé encoding utf-8
Testé encoding utf-8

# Check Encoding to make sure
Get-FileEncoding .\test.html
utf8

正如您所看到的，它肯定适用于Windows 10上的PowerShell v5。我会仔细检查文件格式和您创建的文件的内容，因为可能已经引入了您的编辑器可能无法获取的字符。

如果PowerShell中没有Get-FileEncoding作为cmdlet，则可以运行以下实现：

function Get-FileEncoding([Parameter(Mandatory=$True)]$Path) {
    $bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)

    if(!$bytes) { return 'utf8' }

    switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
        '^efbbbf'   {return 'utf8'}
        '^2b2f76'   {return 'utf7'}
        '^fffe'     {return 'unicode'}
        '^feff'     {return 'bigendianunicode'}
        '^0000feff' {return 'utf32'}
        default     {return 'ascii'}
    }
}

Powershell和UTF-8

2 个答案: