Question

我在Delphi（7）中编写了一个应用程序（心理测试考试），它创建了一个标准的文本文件 - 即该文件的类型为ANSI。

有人将程序移植到Internet上运行，可能使用Java，生成的文本文件类型为UTF-8。

读取这些结果文件的程序必须同时读取Delphi创建的文件和通过Internet创建的文件。

虽然我可以将UTF-8文本转换为ANSI（使用狡猾命名的函数UTF8ToANSI），但我如何提前告诉我哪种文件？

看到我'拥有'文件格式，我想最简单的处理方法是在文件中放置一个已知位置的标记，告诉我程序的来源（Delphi / Internet），但这似乎是在作弊。

提前致谢。

Answer 1

没有100％可靠的方法来识别UTF-8编码的ANSI（例如Windows-1250）编码。那些 ANSI文件不能是有效的UTF-8，但每个有效的UTF-8文件也可能是一个不同的ANSI文件。（更不用说仅限ASCII的数据，根据定义 ANSI和UTF-8，但这纯粹是理论方面。）

例如，序列C4 8D可能是UTF-8中的“č”字符，或者在Windows-1250中可能是“ŤŤ”。两者都是可能和正确的。但是，例如8D 9A在windows-1250中可以是“Ťš”，但它不是有效的UTF-8字符串。

你必须采用某种启发式方式，例如

如果文件包含的序列不能是有效的UTF-8，则假定它是ANSI。
否则，如果文件以UTF-8 BOM（EF BB BF）开头，则假设它是UTF-8（但可能不是，但是，以这些字符开头的纯文本ANSI文件非常不可能）。
否则，假设它是UTF-8。（或者，尝试更多的启发式方法，可能使用文本语言的知识等）。

另见the method used by Notepad。

Answer 2

如果UTF文件以UTF-8字节顺序标记（BOM）开头，这很容易：

function UTF8FileBOM(const FileName: string): boolean;
var
  txt: file;
  bytes: array[0..2] of byte;
  amt: integer;
begin

  FileMode := fmOpenRead;
  AssignFile(txt, FileName);
  Reset(txt, 1);

  try
    BlockRead(txt, bytes, 3, amt);
    result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
  finally    
    CloseFile(txt);
  end;

end;

否则，要困难得多。

Answer 3

如果我们总结，那么：

基本用法的最佳解决方案是使用过时（如果我们使用IsTextUnicode();）;
高级用法的最佳解决方案是使用上面的功能，然后检查BOM（~1KB），然后检查特定操作系统下的区域设置信息，然后才能获得 98％准确度？

其他信息的人可能会发现有趣的事情：

https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA

function FileMayBeUTF8(FileName: WideString): Boolean;
var
 Stream: TMemoryStream;
 BytesRead: integer;
 ArrayBuff: array[0..127] of byte;
 PreviousByte: byte;
 i: integer;
 YesSequences, NoSequences: integer;

begin
   if not WideFileExists(FileName) then
     Exit;
   YesSequences := 0;
   NoSequences := 0;
   Stream := TMemoryStream.Create;
   try
     Stream.LoadFromFile(FileName);
     repeat

     {read from the TMemoryStream}

       BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
           {Do the work on the bytes in the buffer}
       if BytesRead > 1 then
         begin
           for i := 1 to BytesRead-1 do
             begin
               PreviousByte := ArrayBuff[i-1];
               if ((ArrayBuff[i] and $c0) = $80) then
                 begin
                   if ((PreviousByte and $c0) = $c0) then
                     begin
                       inc(YesSequences)
                     end
                   else
                     begin
                       if ((PreviousByte and $80) = $0) then
                         inc(NoSequences);
                     end;
                 end;
             end;
         end;
     until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
     Result := (YesSequences >= NoSequences);

   finally
     Stream.Free;
   end;
end;

现在测试这个功能......

在我的拙见中，只有这样才能正确地检查操作系统字符集，因为最终几乎在所有情况下都会对OS进行一些引用。无论如何都无法改变它......

说明：

WideFileExists（）函数取自TntClasses.pas（Koders.net source）。

Answer 4

首先阅读时尝试将文件解析为UTF-8。如果它无效，UTF-8会将该文件解释为传统编码（ANSI）。这将适用于大多数文件，因为传统编码文件不太可能是有效的UTF-8。

什么窗口调用ANSI是一个依赖于系统区域设置的字符集。并且文本在俄语，亚洲或......窗口上无法正常工作。

虽然VCL在Delphi 7中不支持Unicode，但您仍应在内部使用unicode并仅转换为ANSI以显示它。我将我的一个程序本地化为韩语和俄语，这是我使用它而没有大问题的唯一方法。您仍然只能在设置为韩语的系统上显示朝鲜语本地化，但至少可以在任何系统上编辑文本文件。

Answer 5

//if is possible to decoded,then it is UTF8

function isFileUTF8(const Tex : AnsiString): boolean;
begin
  result := (Tex <> '') and (UTF8Decode(Tex) <> '');
end;

检测'文本'文件类型（ANSI与UTF-8）

5 个答案: