Question

我正在开发一个网络抓取工具，它可以从世界各地的网站抓取数据，并处理不同的语言和编码。

目前我正在使用以下功能，它可以在99％的情况下使用。但是有1％令我头疼。

function convertEncoding($str) {
    return iconv(mb_detect_encoding($str), "UTF-8", $str);
}

Answer 1

您应首先检查下载的页面是否包含列出的字符集，而不是盲目地尝试检测编码。可以在HTTP响应头中设置字符集，例如：

Content-Type:text/html; charset=utf-8

或者在HTML中作为元标记，例如：

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

只有当两者都不可用时才尝试使用mb_detect_encoding（）或其他方法猜测编码。

Answer 2

由于某些字符集是其他字符集的子集，因此无法以100％的速率检测字符串的字符集。如果可能，请尝试明确设置字符集，而不要混合iconv和mbstring函数。我建议使用这样的函数，并尽可能从charset 提供：

function convertEncoding($str, $from = 'auto', $to = "UTF-8") { if($from == 'auto') $from = mb_detect_encoding($str); return mb_convert_encoding ($str , $to, $from); }

Answer 3

您可以尝试使用utf_encode（$ str）。

http://www.php.net/manual/en/function.utf8-encode.php#89789

或者您可以使用

替换内容类型元标记

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

来自已抓取内容的标头

如何在PHP上将任何字符编码转换为UTF8

3 个答案: