我正在开发一个Node.js web scraper应用程序,其代码如下所示,并尝试在功能上定位我的代码。见下文:
const Promise = require('bluebird');
const fetch = require('node-fetch');
const cheerio = require('cheerio');
const scrapeUri = uri => fetch(uri); // how should i pass the uri from here
const fetchURIs = URIs => Promise.all(URIs.map(scrapeUri));
const getBodies = pages => Promise.all(pages.map(page => page.text()));
const toSource = source => cheerio.load(source);
const shouldScrape = ($) => {
const shouldIndex = $('meta[name="robots"]').attr('content');
if (['noindex', 'nofollow'].indexOf(shouldIndex) !== -1) {
return false;
}
return true;
};
const objectifyContent = ($) => { // to be accessed here
return {
meta: {
index_timestamp: new Date(),
title: $('title').html(),
// TODO: this will totally fail in some instances, need to pass uri from initial instance
uri: $('link[rel="canonical"]').attr('href'),
description: $('meta[name="description"]').attr('content'),
},
};
};
在objectifyContent
中,从初始scrapeUri
访问uri的纯粹方式是什么,而不是通过访问规范来获取页面的网址?我知道一些方法我可以设置一个变量并让它继承范围,但我想知道在Node.js的上下文中是否有更清晰,更实用的方法。
来电者会像:
fetchUris(myUris).then(values => getBodies(values).then(sources => res.send(sources.map(toSource).filter(shouldScrape).map(objectifyContent));)
答案 0 :(得分:0)
修改此scrapeUri
以通过承诺传递URI,并相应地修改处理程序
const scrapeUri = uri => fetch(uri).then(
webpage => [uri, webpage]
)