Question

我有这段代码：

request({ url: 'http://www.myurl.com/' }, function(error, response, html) {
  if (!error && response.statusCode == 200) {
    console.log($('title', html).text());
  }
});

但是，我爬行的网站可以有不同的字符集（utf8，iso-8859-1等）如何获取它并将html编码/解码总是正确的编码（utf8）？

谢谢，对不起我的英语;）

Answer 1

网站可以在返回的HTML内部的content-type标头或content-type元标记中返回内容编码，例如：

<meta http-equiv="Content-Type" content="text/html; charset=latin1"/>

您可以使用charset模块自动检查这两个模块。并非所有网站或服务器都会指定编码，因此您需要回退到从数据本身检测字符集。 jschardet模块可以帮助您。

一旦你完成了charset，你可以使用iconv模块进行实际的转换。这是一个完整的例子：

request({url: 'http://www.myurl.com/', encoding: 'binary'}, function(error, response, html) {
    enc = charset(response.headers, html)
    enc = enc or jchardet.detect(html).encoding.toLowerCase()
    if enc != 'utf-8'
        iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE')
        html = iconv.convert(new Buffer(html, 'binary')).toString('utf-8')
    console.log($('title', html).text());
});

Answer 2

首先，您可以发送 Accept-Charset 标头，这会阻止网站在其他字符集中发送数据。

收到回复后，您可以查看内容类型标题，查看 charset 条目并进行相应的处理。

Anothr hack（我过去曾经使用过）当内容编码未知时，尝试使用所有可能的内容编码进行解码，并坚持使用不会抛出异常的内容（尽管在python中使用）。

如何在NodeJS中编码/解码字符集编码？

2 个答案: