Question

我正在尝试使用PHP中的cURL编写自己的网络爬虫。

[...]
mb_internal_encoding('UTF-8');
mb_language('uni');
$this->_curl = curl_init();
curl_setopt($this->_curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($this->_curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($this->_curl, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($this->_curl, CURLOPT_MAXREDIRS, 0);
curl_setopt($this->_curl, CURLOPT_TIMEOUT, 10);
curl_setopt($this->_curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10');
curl_setopt($this->_curl, CURLOPT_HEADER, true);
curl_setopt($this->_curl, CURLOPT_RETURNTRANSFER, true);
$header = array(
            "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3",
            "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
            "Keep-Alive: 115",
            "Connection: keep-alive",
);
curl_setopt($this->_curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->_curl, CURLOPT_URL, $url);
curl_setopt($this->_curl, CURLOPT_POST, false);
curl_setopt($this->_curl, CURLOPT_POSTFIELDS, array());
curl_setopt($this->_curl, CURLOPT_HTTPGET, true);
$page = curl_exec($this->_curl);
[...]

问题是网站的字符集。正如你在

上看到的那样

http://blog.163.com/drewes_4711/blog/static/179317021201151624826557/

有一个标题"Content-Type: ...;charset=GBK"所以我可以mb_convert_encoding($content, "UTF-8", "GBK");但我该怎么做

http://tech.hexun.com/2011-06-21/130756909.html

似乎是同一个字符集，但它没有在HTTP标头中给出。所以我在德语变音符号，中文和亚洲语言方面存在大量问题...是否有任何模块或代码片段可用于确定任何下载的带有cURL的HTML网站的字符集？

Answer 1

第二个链接包含：

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

之前的所有数据看起来都像纯ASCII。所以你可以尝试，如果HTTP标题没有提供线索，只需解析（假设普通的ASCII，而不是UTF-8 - 可能会破坏），直到你找到那个标题。

显然无法保证这一点。如果服务器没有发送编码，并且页面也没有该标题，那你就不走运了。没有通用的手段来检测给定数据的编码。

猜PHP中的字符集编码

1 个答案: