Question

我正在尝试检测字符串的字符编码，但我无法获得正确的结果例如：

$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

该代码输出ISO-8859-1，但应该是Windows-1252。

这有什么问题？

修改
更新了示例，以响应@ raina77ow。

$str = "&euro;&sbquo;&fnof;&bdquo;&hellip;" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;

我再次得到错误的结果。

Answer 1

PHP中Windows-1252的问题在于它几乎永远不会被检测到，因为只要文本包含0x80到0x9f之外的任何字符，它就不会被检测为Windows- 1252。

这意味着如果你的字符串包含一个普通的ASCII字母，如“A”，甚至是一个空格字符，PHP会说这不是有效的Windows-1252，在你的情况下，它会回到下一个可能的编码，这是ISO 8859-1。这是一个PHP错误，请参阅https://bugs.php.net/bug.php?id=64667。

Answer 2

尽管使用ISO-8859-1和CP-1252编码的字符串具有不同的字节代码表示：

<?php
$str = "&euro; &sbquo; &fnof; &bdquo; &hellip;" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
    $new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
    printf('%15s: %s detected: %10s explicitly: %10s',
        $encoding,
        implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
        mb_detect_encoding($new),
        mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
    );
    echo PHP_EOL;
}

结果：

Windows-1252: 802082208320842085 detected:            explicitly: ISO-8859-1
  ISO-8859-1: 3f203f203f203f203f detected:      ASCII explicitly: ISO-8859-1

...从我们在这里看到的情况来看，mb_detect_encoding的第二个参数似乎存在问题。使用mb_detect_order代替参数会产生非常相似的结果。

在PHP中检测正确的字符编码？

2 个答案: