如何使用命令行知道文件的编码?

时间:2017-03-31 08:44:25

标签: cmd

是否有任何命令知道Windows中文件的编码?
与文件A.txt编码类似UTF-16

1 个答案:

答案 0 :(得分:2)

在Windows命令提示符(cmd)中,没有我知道的命令,它能够确定文本文件的编码方式。

尽管如此,我编写了一个小批处理文件,可以检查一些条件,从而确定给定的文本文件是ASCII编码还是ANSI编码或Unicode编码(UTF-8或UTF-16,Little Endian)或Big Endian)。首先,它检查第一行(非空)行是否包含零字节,这表示该文件不是ASCII / ANSI编码的。接下来,它检查前几个字节是否构成UTF-8 / UTF-16的Byte Order Mark(BOM)。由于BOM对于Unicode编码文件是可选的,因此对于ASCI / ANSI编码的文件来说,它的缺失不是明确的标志。

所以这里是代码,其中包含大量解释性说明(rem) - 我希望它有所帮助:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1" & rem // (provide file via the first command line argument)

rem // Store current code page to be able to restore it finally:
for /F "tokens=2 delims=:" %%C in ('chcp') do set /A "$CP=%%C"
rem /* Change to code page 437 (original IBM PC or DOS code page) temporarily;
rem    this is necessary for extended characters not to be converted: */
> nul chcp 437

rem // Attempt to read first line from file; this fails if zero-bytes occur:
(
    rem // Reset line string variable:
    set "LINE="
    rem /* The loop does not iterate over an empty file or one with empty lines only;
    rem    therefore, the behaviour is the same as when zero-bytes occur: */
    for /F usebackq^ delims^=^ eol^= %%L in ("%_FILE%") do (
        rem // Store first line string:
        set "LINE=%%L"
        rem // Abort reading file after first non-empty line:
        goto :NEXT
    )
) || (
    rem /* The `for /F` loop returns a non-zero exit code in case the file is empty,
    rem    contains empty lines only or the first non-empty line contains zero-bytes;
    rem    to determine whether there are zero-bytes, let `find` process the file,
    rem    which converts any zero-bytes to spaces, so `for /F` can read the file: */
    (
        rem // In case the file is empty, the loop does not iterate:
        for /F delims^=^ eol^= %%L in ('^< "%_FILE%" find /V ""') do (
            rem // Abort reading file after first non-empty line:
            goto :ZERO
        )
    ) || (
        rem /* The loop did not iterate, so the file is empty or holds empty lines only;
        rem // restore the initial code page prior to termination: */
        > nul chcp %$CP%
        >&2 echo The file is empty, hence encoding cannot be determined!
        exit /B
    )
)

rem // This point is reached in case the file contains zero-bytes:
:ZERO
rem // Restore the initial code page prior to termination:
> nul chcp %$CP%
>&2 echo NULL-bytes detected in first line, so file is non-ASCII/ANSI!
exit /B

rem // This point is reached in case the file does not contain any zero-bytes:
:NEXT
rem /* Build Byte Order Marks (BOMs) for UTF-16-encoded text (Little Endian and Big Endian)
rem    and for UTF-8-encoded text: */
for /F "tokens=1-3" %%A in ('
    forfiles /P "%~dp0." /M "%~nx0" /C "cmd /C echo 0xFF0xFE 0xFE0xFF 0xEF0xBB0xBF"
') do set "$LE=%%A" & set "$BE=%%B" & set "$U8=%%C"

rem // Check whether the first line of the file begins with any of the BOMs:
if not "%LINE:~,2%"=="%$LE%" if not "%LINE:~,2%"=="%$BE%" if not "%LINE:~,3%"=="%$U8%" goto :CONT
rem /* One of the BOMs has been encountered, hence the file is Unicode-encoded;
rem    restore the initial code page prior to termination: */
> nul chcp %$CP%
>&2 echo BOM encountered in first line, so file is non-ASCII/ANSI!
exit /B 1

rem // This point is reached in case the file does not appear as Unicode-encoded:
:CONT
rem // Restore the initial code page prior to termination:
> nul chcp %$CP%
echo The file appears to be an ASCII-/ANSI-encoded text.

endlocal
exit /B