我想从Google购物中获取商品图片。为此,我使用了cheerio和nodejs。因此,例如,如果我浏览到xbox的链接(https://www.google.com/search?tbm=shop&hl=de-de&tbs=vw:l&q=xbox),并且正在开发人员工具中检查图像,则会得到很长的base64字符串。但是在cheerio中执行相同操作时,我得到的字符串看起来像这样:
data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
此外,在chrome中打开时,其编码为webp。
我已经尝试使用以下代码更改User-Agent:
var customHeaderRequest = request.defaults({
headers: {'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'},
});
但是我仍然得到了缩短的图像数据。我还能尝试用cheerio获得整个图像字符串吗?
编辑:如何重新创建错误:
首先,您必须使用给定的用户代理创建自定义的Header请求。然后将上面提到的网址传递给它(xbox google shopping),这应该返回一个正文变量,其中包含html代码,您可以将其传递给cheerio(我还内置了alittel代码,以保存代码转储以查看与cheerio的处理方式相比,服务器的响应方式)。然后,它应查找给定的类,其中包括带有base64数据的图像。现在,在控制台上,您应该可以看到格式错误的数据。
const cheerio = require('cheerio');
const request = require('request');
const fs = require('fs');
var customHeaderRequest = request.defaults({
headers: {'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'},
});
url = 'https://www.google.com/search?tbm=shop&hl=de-de&tbs=vw:l&q=xbox';
customHeaderRequest.get(url, function(err, resp, body) {
//write curled body to a file for further inspection
fs.writeFile("dump.html", body, function(err) {
if(err) {
console.log("file saved");
return console.log(err);
}
console.log("The file was saved!");
});
$ = cheerio.load(body);
$('.sh-dlr__list-result .sh-dlr__content').each((index, value) => {
let entryObj = {};
//image
$(value).find('.TL92Hc').each(function (idx, ele) {
//image
console.log($(ele).attr('src'));
});
});
});
如果我执行此脚本,它将提供以下输出:
C:\ mydir>节点reproduce_error.js 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:图像/ gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw == 数据:image / gif; base64,R0lGODlhAQABAIAAAP /////// yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw ==