Question

我试图创建一个c程序，它从命令行获取文件作为输入，并确定文件类型是什么。我的选择是

空
ASCII文字
ISO-8859文本
UTF-8 Unicode

当我必须创建我为ASCII编写的if语句时：

if(c != EOF && c <= 127)

对于ISO-8859，我写了：

if((c != EOF && c <= 127) || (c >= 160 && c<= 255))

当我使用他们应该能够指定的输入提供文件时，这两个工作正常。但是当我使用UTF-8 Unicode时，我的if语句看起来像这样：

if(c != EOF && c <= 255)

这不起作用。我一直得到错误的结果。

有人可以帮助我进一步指定UTF-8 Unicode文本吗？

谢谢

Answer 1

UTF-8不允许192-193和245-255范围;然而，它并不像他们经常在ISO-8859-1文本中那样，而且我个人并不真正依赖于“120-160差距”，因为Windows-1252经常可以互换使用ISO-8859-1 ¹，没有它。

检测文件是否为UTF-8的更可靠方法是检查其多字节序列是否符合to the UTF-8 "syntax"，而不仅仅是检查字节范围。

FILE *fp = ...;
int ch;
bool good_utf8 = true;
bool good_ascii = true;
bool empty = true;
bool good_iso8859_1 = true;
while((ch=fgetc(fp))!=EOF) {
    empty = false;
    int extra = 0;
    if(ch>>7 == 0) {
        // ok, if the high bit is not set it's a "regular" character
    } else {
        // ASCII never has the high bit set
        good_ascii = false;
        // ISO8859-1 gap
        if(ch>=120 && ch<= 160) good_iso8859_1 = false;
        // check if it's a valid UTF-8 multibyte sequence
        if((ch>>5) == 6) {
            // 110xxxxx => one continuation byte
            extra = 1;
        } else if((ch>>4) == 14) {
            // 1110xxxx => two continuation bytes
            extra = 2;
        } else if((ch>>3) == 30) {
            // 11110xxx => three continuation bytes
            extra = 3;
        } else {
            // there's no other valid UTF-8 sequence prefix
            good_utf8 = false;
        }
    }
    for(; good_utf8 && extra > 0; --extra) {
        ch = fgetc(fp);
        if(ch>=120 && ch<= 160) good_iso8859_1 = false;
        // all the stated continuation bytes must be present,
        // and they have to follow the 10xxxxxx pattern
        if(ch==EOF || ((ch>>6) != 2)) {
            good_utf8 = false;
        }
    }
}
fclose(fp);

ISO-8859不是单一的字符集，它是多个相关的字符集;我假设你在谈论ISO-8859-1（AKA“Latin1”），因为你在谈论120-160的差距;如果你必须检测 ISO-8859的哪个变体，你必须检查不同的间隙。

C程序中的数据类型规范ASCII，ISO-8859，UTF-8 Unicode

1 个答案: