Question

我正在使用这段代码来下载网页（使用request库）并解码所有内容（使用iconv-lite库）。 loader函数用于从网站正文中查找一些元素，然后将其作为JavaScript对象返回。

request.get({url: url, encoding: null}, function(error, response, body) {
        // if webpage exists, process it, otherwise throw 'not found' error
        if (response.statusCode === 200) {
          body = iconv.decode(body, "iso-8859-1");
          const $ = cheerio.load(body);
          async function show() {
            var data = await loader.getDay($, date, html_tags, thumbs, res, image_thumbnail_size);
            res.send(JSON.stringify(data));
          }
          show();
        } else {
          res.status(404);
          res.send(JSON.stringify({"error":"No content for this date."}))
        }
      });

页面以ISO-8859-1格式编码，内容看起来很正常，没有坏字符。当我不使用iconv-lite时，会出现一些字符，例如。 ü看起来像这样：现在，当我使用上面提供的代码中的库时，大多数字符看起来不错，但有些字符例如。 š是一个空框，即使它们在网站上显示没有任何问题。

我确定这不是cheerio的问题，因为当我使用res.send(body);或res.send(JSON.stringify({"body":body}));打印输出时，那里仍然有空框字符。也许是Express的问题？有办法解决吗？

编辑：我将空框字符复制到Google，它已更改为Âš，也许很重要

此外，我尝试使用res.charset更改Express的输出，但这无济于事。

Answer 1

我使用以下网站：https://validator.w3.org/nu/?doc=https%3A%2F%2Fapod.nasa.gov%2Fapod%2Fap170813.html检查我正在抓取的页面是否确实具有ISO-8859-1编码，事实证明它具有Windows-1252编码。我更改了API（var encoding = 'windows-1252'）中的编码，现在可以正常使用了。

iconv-lite即使我使用正确的解码也无法正确解码所有内容

1 个答案: