Question

我从第三方网站获取了Feed，有时我必须应用utf8_decode和其他时间utf8_encode来获得所需的可见输出。

如果错误地将相同的东西应用了两次/或者使用了错误的方法，我会得到一些更难看的东西，这就是我想要改变的东西。

如何检测何时应对字符串应用？

更新

实际上内容返回UTF-8，但内部有部分不是。

Answer 1

我不能说我可以依靠mb_detect_encoding()。有一段时间有一些怪异的误报。

我发现在每种情况下运作良好的最普遍的方式是：

if (preg_match('!!u', $string))
{
   // this is utf-8
}
else 
{
   // definitely not utf-8
}

Answer 2

您可以使用

mb_detect_encoding - 检测字符编码

charset也可能在HTTP Response Headers或响应数据本身中可用。

示例：

var_dump(
    mb_detect_encoding(
        file_get_contents('http://stackoverflow.com/questions/4407854')
    ),
    $http_response_header
);

输出（codepad）：

string(5) "UTF-8"
array(9) {
  [0]=>
  string(15) "HTTP/1.1 200 OK"
  [1]=>
  string(33) "Cache-Control: public, max-age=11"
  [2]=>
  string(38) "Content-Type: text/html; charset=utf-8"
  [3]=>
  string(38) "Expires: Fri, 10 Dec 2010 10:40:07 GMT"
  [4]=>
  string(44) "Last-Modified: Fri, 10 Dec 2010 10:39:07 GMT"
  [5]=>
  string(7) "Vary: *"
  [6]=>
  string(35) "Date: Fri, 10 Dec 2010 10:39:55 GMT"
  [7]=>
  string(17) "Connection: close"
  [8]=>
  string(21) "Content-Length: 34119"
}

Answer 3

function str_to_utf8 ($str) {
    $decoded = utf8_decode($str);
    if (mb_detect_encoding($decoded , 'UTF-8', true) === false)
        return $str;
    return $decoded;
}

var_dump(str_to_utf8("« Chrétiens d'Orient » : la RATP fait marche arrière"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)
var_dump(str_to_utf8("Â« ChrÃ©tiens d'Orient Â» : la RATP fait marche arriÃ¨re"));
//string '« Chrétiens d'Orient » : la RATP fait marche arrière' (length=56)

Answer 4

Feed（我想你的意思是某种基于xml的提要）应该在标题中有一个属性，告诉你编码是什么。如果没有，你就没有运气，因为你没有可靠的方法来识别编码。

Answer 5

编码自动保护不是防弹，但您可以尝试mb_detect_encoding()。另请参阅mb_check_encoding()。

如何检测是否必须对字符串应用utf8解码或编码？

5 个答案: