Question

我有一个包含大量字符串的数据库。其中一些是正确的UTF-8编码，其中一些不是。因此，我设置了一个脚本，从db中选择100个字符串。以下函数决定字符串是否包含UTF-8（无论它是否正确）：

function detectUTF8($text) {
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs',
    $text);
}

The output of of script is these strings containing UTF-8 and - after a line break - the utf8_decode() string. Since some strings are double encoded, I decode all strings which you can see there.

The result is a list with some entries with 2 strings each: one is correct, the other one is wrong. You can see it here。但是如何确定哪一个是正确的？

我希望你能帮助我。提前谢谢！

Answer 1

mb_detect_encoding（$ text，“UTF-8”）;

您可能必须使用--enable-mbstring构建php或使用yum / apt安装php-mbstring包，但php可以帮助您检测多字节字符串编码。

Answer 2

您可以使用utf8_decode并查看detectUTF8函数是否仍然有效UTF-8。

PHP：2个字符串 - 哪一个是UTF-8而哪一个不是？

2 个答案: