(重新)在PHP中转换损坏的UTF-8输入?

时间:2016-06-30 12:32:43

标签: php unicode encoding utf-8

我的PHP脚本从其他地方接收外部JSON数据;不幸的是,在某个地方,这些数据的UTF-8字符被破坏了。

例如,我应该收到字符串" 40.80 – Origin:",但是我得到类似" 40.80 â Origin:"的内容。使用hexdumputfinfo.pl检查腐败字符周围的这些字符,我得到:

$ echo " – O" | perl utfinfo.pl 
Got 4 uchars
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'O' u: 79 [0x004F] b: 79 [0x4F] n: LATIN CAPITAL LETTER O [Basic Latin]

$ echo " – O" | hexdump -C
00000000  20 e2 80 93 20 4f 0a                              | ... O.|

$ echo " â O" | perl utfinfo.pl 
Got 6 uchars
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'â' u: 226 [0x00E2] b: 195,162 [0xC3,0xA2] n: LATIN SMALL LETTER A WITH CIRCUMFLEX [Latin-1 Supplement]
Char: '' u: 128 [0x0080] b: 194,128 [0xC2,0x80] n: <control> [Latin-1 Supplement]
Char: '' u: 147 [0x0093] b: 194,147 [0xC2,0x93] n: <control> [Latin-1 Supplement]
Char: ' ' u: 32 [0x0020] b: 32 [0x20] n: SPACE [Basic Latin]
Char: 'O' u: 79 [0x004F] b: 79 [0x4F] n: LATIN CAPITAL LETTER O [Basic Latin]

$ echo " â O" | hexdump -C
00000000  20 c3 a2 c2 80 c2 93 20  4f 0a                    | ...... O.|

因此,基本上用于en-dash的UTF-8字节序列,0xE2,0x80,0x93,不知何故变为0xC3,0xA2 0xC2,0x80 0xC2,0x93。 (看起来,我可以摆脱后两个的0xC2,但我无法看到如何将0xC3,0xA2转换回第一个字节的0xE2。)

无论如何,我想我可以使用PHP的一些内置函数重新转换回UTF-8,所以我写了这个小测试脚本,test_utf8.php

<?php
# 40.80  – Origin:
$tstr = "40.80  â Origin:";
echo "$tstr\n";
print(mb_detect_encoding ($tstr) . "\n"); // UTF-8 here

$tstrB = mb_convert_encoding($tstr, "UTF-8");
echo "$tstrB\n";

$tstrC = iconv('ASCII', 'UTF-8//IGNORE', $tstr);
echo "$tstrC\n";

$tstrD = utf8_encode($tstr);
echo "$tstrD\n";

?>

...遗憾的是,它不起作用 - 这是我通过php CLI运行它时在终端输出的输出:

$ php test_utf8.php
40.80  â Origin:
UTF-8
40.80  â Origin:
PHP Notice:  iconv(): Detected an illegal character in input string in /path/to/test_utf8.php on line 10

40.80  â Origin:

......也就是说,我腐败了一切。 (请注意,mb_detect_encoding由于某种原因将此字符串检测为UTF-8。

那么,如何将此字符串重新转换回正确的UTF-8?

编辑:(联合国)幸运的是,SO摆脱了不好的角色,所以你只能通过复制粘贴:(来重建这个例子,但希望hexdumps提供足够的信息?如果没有,我将上面的内容重新发布到Github Gist,在原始版本中似乎保留了字符......

1 个答案:

答案 0 :(得分:0)

我想我得到了它,感谢Convert utf8-characters to iso-88591 and back in PHP

  

utf8_decode - 将使用UTF-8编码的ISO-8859-1字符转换为单字节ISO-8859-1

所以,我尝试添加到脚本中:

$tstrF = utf8_decode($tstr);
echo "$tstrF\n";

...这样就打印出40.80 – Origin: