Question

我已经使用puppeteer和node js（快速）创建了scraper。这个想法是当服务器收到http请求时，我的应用程序将开始抓取页面。

问题是我的应用程序一次收到多个http请求。搜寻过程将一遍又一遍，直到没有http请求命中。如何仅启动一个http请求，并将另一个请求排队，直到第一个抓取过程完成？

目前，我尝试使用下面的代码node-request-queue，但没有任何困难。

var express = require("express");
var app = express();
var reload = require("express-reload");
var bodyParser = require("body-parser");
const router = require("./routes");
const RequestQueue = require("node-request-queue");

app.use(bodyParser.urlencoded({ extended: true }));
app.use(bodyParser.json());

var port = process.env.PORT || 8080;

app.use(express.static("public")); // static assets eg css, images, js

let rq = new RequestQueue(1);

rq.on("resolved", res => {})
  .on("rejected", err => {})
  .on("completed", () => {});

rq.push(app.use("/wa", router));

app.listen(port);
console.log("Magic happens on port " + port);

Answer 1

您可以使用puppeteer-cluster（免责声明：我是作者）。您可以设置只有一个工作池的群集。因此，分配给集群的作业将一个接一个地执行。

由于您没有说出伪造者脚本应该做什么，因此在此代码示例中，我以页面标题为例（通过/wa?url=...给出）作为示例，并将结果提供给响应。 / p>

// setup the cluster with only one worker in the pool
const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 1,
});

// define your task (in this example we extract the title of the given page)
await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    return await page.evaluate(() => document.title);
});

// Listen for the request
app.get('/wa', async function (req, res) {
    // cluster.execute will run the job with the workers in the pool. As there is only one worker
    // in the pool, the jobs will be run sequentially
    const result = await cluster.execute(req.query.url);
    res.end(result);
});

这是一个最小的示例。您可能想在侦听器中捕获任何错误。有关更多信息，请在存储库中使用express使用screenshot server来查看更复杂的示例。

Answer 2

x:Array是为<Button Command="{Binding MyCommand}">10 <Button.CommandParameter> <x:Array Type="system:Object"> <system:String>Y</system:String> <system:Double>10</system:Double> </x:Array> </Button.CommandParameter> </Button>软件包创建的，与node-request-queue不同。

您可以使用最简单的承诺队列库p-queue完成队列。它具有并发支持，并且比任何其他库更具可读性。您以后可以轻松地从诺言切换到诸如request之类的健壮队列。

这是创建队列的方法，

express

这是将异步函数添加到队列的方法，如果您监听它，它将返回已解析的数据，

bull

因此，无需添加路由到队列，只需删除周围的其他行并保持路由器不变即可。

const PQueue = require("p-queue");
const queue = new PQueue({ concurrency: 1 });

在您的一个路由器文件中，

queue.add(() => scrape(url));

请确保将队列包括在路线内，而不要反过来。在// here goes one route app.use('/wa', router);文件或运行抓取工具的任何位置上，仅创建一个队列。

这是抓取文件的内容，您可以使用任何想要的内容，这只是一个有效的虚拟对象，

const routes = require("express").Router();

const PQueue = require("p-queue");
// create a new queue, and pass how many you want to scrape at once
const queue = new PQueue({ concurrency: 1 });

// our scraper function lives outside route to keep things clean
// the dummy function returns the title of provided url
const scrape = require('../scraper');

async function queueScraper(url) {
  return queue.add(() => scrape(url));
}

routes.post("/", async (req, res) => {
  const result = await queueScraper(req.body.url);
  res.status(200).json(result);
});

module.exports = routes;

使用curl的结果：

这里是my git repo，其中包含具有示例队列的工作代码。

警告

如果您使用任何这样的队列，您会注意到您同时处理100个结果时遇到问题，并且由于队列中还有其他99个url，因此对您的api的请求将不断超时。因此，您以后必须学习更多有关真实队列和并发的信息。

一旦您了解了队列的工作原理，有关cluster-puppeteer，rabbitMQ，bull队列等问题的其他答案将对您有所帮助：）。

NodeJS HTTP请求队列

2 个答案:

使用curl的结果：

警告