Question

我有一些文字文件由不同的字符编码重新编码，例如ascii，utf-8，big5，gb2312。

现在我想知道他们准确的字符编码，用文本编辑器查看它们，否则，它们会出现乱码。

我在网上搜索，发现file命令可以display the character encoding of a file，例如：

$ file -bi *
text/plain; charset=iso-8859-1
text/plain; charset=us-ascii
text/plain; charset=iso-8859-1
text/plain; charset=utf-8

不幸的是，使用big5和gb2312编码的文件都显示charset=iso-8859-1，所以我仍然无法区分。有没有更好的方法来检查文本文件的字符编码？

Answer 1

在某种程度上，@ ewcz的建议有效。

$ uchardet *
big5.txt: BIG5
conf: ASCII
gb2312-windows.txt: GB18030
gb.txt: GB18030
test.java: UTF-8

和

enca -L chinese *
big5.txt: Traditional Chinese Industrial Standard; Big5
conf: 7bit ASCII characters
gb2312-windows.txt: Simplified Chinese National Standard; GB2312
  CRLF line terminators
gb.txt: Simplified Chinese National Standard; GB2312
test.java: Universal transformation format 8 bits; UTF-8

Answer 2

您可以使用像 detect-file-encoding-and-language 这样的命令行工具：

$ npm install -g detect-file-encoding-and-language

然后你可以像这样检测编码：

$ dfeal "/home/user name/Documents/subtitle file.srt"
# Possible result: { language: french, encoding: CP1252, confidence: { language: 0.99, encoding: 1 } }

确保你已经安装了 Node.js 和 NPM！如果您还没有安装它：

$ sudo apt install nodejs npm

如何在Linux中检查文件的字符编码

2 个答案: