Question

所以我尝试使用带有电子的 simplecrawler npm 构建一个网络爬虫。

在 main.js 中

sitecrawler(
   inputUrl, 
   (details) => ipcRenderer.send('Data', details), 
   (error) => ipcRenderer.send('Error', error),
   (_) => ipcRenderer.send('Done'),
);

在 crawler.js 中

function siteCrawler(url, dataCb, errorCb, doneCb) {
    // initialised crawler...

    // on fetch complete I get each queueItem and response Buffer
    dataDetails = { url:queueItem.url, depth: queueItem.depth } // and many other details
    insertDataDetails(dataDetails); // to DB
    extractMoreDetails(responseBuffer); 
      // use cheerio extract h1 h2 etc add those to other tables as well
   
    dataCb(dataDetails);
}

因此，在上述场景中，所有数据都转到相应的表，如 dataDetails、h1Details 等，但只有 dataDetails 通过 dataCb 回调返回给 main，然后通过 ipc 调用发送到前端。

我面临的问题可能是爬虫在给定域内遍历了 1000 个 url，我想对实时数据进行动态分页（dataCb 就像一个实时数据一样）。现在分页是我必须直接从 db 调用而不是回调和 ipc 调用来做的事情。

同样的问题 h1Data, h2Data。

由于渲染发生时所有数据都填充到数据库中，因此我如何使其高效并在反应端看起来实时。如果我使用回调，我无法控制通过 ipc 发送的数据量，回调只发送 dataDetails 而不是 h1、h2 细节。

如果我使用 db 调用，在渲染过程中，db 可能是空的，爬虫可能刚刚启动，或者即使 db 有数据，渲染后更多的数据会添加到 db，因为爬虫仍在后台运行。

任何帮助或见解或建议。我也很高兴改变方法。

我正在使用电子、sqlite3、simplecrawler、cheerio 和 react。 (npm 包)

使用电子、简单爬虫和 sqlite 构建网络爬虫

0 个答案: