Question

我正在尝试使用cheerio从页面中抓取数据，并按以下方式请求：

1）转到网址1a（http://example.com/0）
2）提取url 1b（http://example2.com/52）
3）转到网址1b
4）提取一些数据并保存
5）转到网址1a + 1（http://example.com/1，我们称之为2a）
6）提取url 2b（http://example2.com/693）
7）转到url 2b
8）提取一些数据并保存等...

我正在努力解决如何做到这一点（注意，我只熟悉节点js和cheerio /请求这个任务，即使它可能不优雅，所以我不是在寻找替代的库或语言来做到这一点对不起）。我想我错过了一些东西，因为我甚至无法想到这是如何起作用的。

修改

让我以另一种方式尝试。这是代码的第一部分：

    var request = require('request'),
    cheerio = require('cheerio');

    request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1&s=0', function(error, response, html) {

    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html, {
          xmlMode: true
        });

        var id = ($('work').attr('id'))
        var total = ($('record').attr('total'))
    }
});

第一个返回的页面如下所示

<response>
  <query>date:[2000 TO 2014]</query>
  <zone name="book">
    <records s="0" n="1" total="69977" next="/result?l-advformat=Thesis&sortby=dateDesc&q=+date%3A%5B2000+TO+2014%5D&l-availability=y&l-australian=y&n=1&zone=book&s=1">
      <work id="189231549" url="/work/189231549">
        <troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
        <title>
        Design of physiological control and magnetic levitation systems for a total artificial heart
        </title>
        <contributor>Greatrex, Nicholas Anthony</contributor>
        <issued>2014</issued>
        <type>Thesis</type>
        <holdingsCount>1</holdingsCount>
        <versionCount>1</versionCount>
        <relevance score="0.001961126">vaguely relevant</relevance>
        <identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
      </work>
    </records>
  </zone>
</response>

上面的URL需要逐渐增加s = 0，s = 1等“总”次数。 'id'需要在第二个请求中输入下面的URL：

request('http://api.trove.nla.gov.au/work/" +(id)+ "?key=6k6oagt6ott4ohno&reclevel=full', function(error, response, html) {

    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html, {
          xmlMode: true
        });

        //extract data here etc.

    }
});

例如，当第一个请求返回使用id =“189231549”时，第二个返回的页面看起来像这样

<work id="189231549" url="/work/189231549">
  <troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
  <title>
    Design of physiological control and magnetic levitation systems for a total artificial heart
  </title>
  <contributor>Greatrex, Nicholas Anthony</contributor>
  <issued>2014</issued>
  <type>Thesis</type>
  <subject>Total Artificial Heart</subject>
  <subject>Magnetic Levitation</subject>
  <subject>Physiological Control</subject>
  <abstract>
    Total Artificial Hearts are mechanical pumps which can be used to replace the failing natural heart. This novel study developed a means of controlling a new design of pump to reproduce physiological flow bringing closer the realisation of a practical artificial heart. Using a mathematical model of the device, an optimisation algorithm was used to determine the best configuration for the magnetic levitation system of the pump. The prototype device was constructed and tested in a mock circulation loop. A physiological controller was designed to replicate the Frank-Starling like balancing behaviour of the natural heart. The device and controller provided sufficient support for a human patient while also demonstrating good response to various physiological conditions and events. This novel work brings the design of a practical artificial heart closer to realisation.
  </abstract>
  <language>English</language>
  <holdingsCount>1</holdingsCount>
  <versionCount>1</versionCount>
  <tagCount>0</tagCount>
  <commentCount>0</commentCount>
  <listCount>0</listCount>
  <identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>

所以现在我的问题是如何将这两个部分（循环）绑在一起以实现结果（下载和解析约70000页）？

我不知道如何在JavaScript中为Node.js编写代码。我是JavaScript的新手

Answer 1

您可以通过研究现有的着名网站复印机（闭源或开源）来了解如何做到这一点

例如 - 使用http://www.tenmax.com/teleport/pro/home.htm的试用版废弃您的网页然后尝试使用http://www.httrack.com，您应该清楚地了解他们是如何做到的（以及如何做到这一点）

关键编程概念是lookup cache和task queue

如果您的解决方案可以扩展到几个node.js工作进程和多达几页，那么递归不是成功的概念

编辑：澄清评论后

在开始将您的报废引擎重新编写为更具可扩展性的体系结构之前，作为一个新的Node.js开发人员，您可以使用由Node.js callback hell创建的wait.for包提供的{{3}}的同步替代方法开始。 @卢西奥 - 间TATO。

以下代码为我提供了您提供的链接

var request = require('request');
var cheerio = require('cheerio');
var wait = require("wait.for");

function requestWaitForWrapper(url, callback) {
  request(url, function(error, response, html) {
    if (error)
      callback(error, response);
    else if (response.statusCode == 200)
      callback(null, html);
    else
      callback(new Error("Status not 200 OK"), response);
  });
}

function readBookInfo(baseUrl, s) {
  var html = wait.for(requestWaitForWrapper, baseUrl + '&s=' + s.toString());
  var $ = cheerio.load(html, {
    xmlMode: true
  });

  return {
    s: s,
    id: $('work').attr('id'),
    total: parseInt($('records').attr('total'))
  };
}

function readWorkInfo(id) {
  var html = wait.for(requestWaitForWrapper, 'http://api.trove.nla.gov.au/work/' + id.toString() + '?key=6k6oagt6ott4ohno&reclevel=full');
  var $ = cheerio.load(html, {
    xmlMode: true
  });

  return {
    title: $('title').text(),
    contributor: $('contributor').text()
  }
}

function main() {
  var baseBookUrl = 'http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1';
  var baseInfo = readBookInfo(baseBookUrl, 0);

  for (var s = 0; s < baseInfo.total; s++) {
    var bookInfo = readBookInfo(baseBookUrl, s);
    var workInfo = readWorkInfo(bookInfo.id);
    console.log(bookInfo.id + ";" + workInfo.contributor + ";" + workInfo.title);
  }
}

wait.launchFiber(main);

Answer 2

您可以使用其他异步模块来处理多个页面的多个请求和迭代。在此处了解有关异步的更多信息https://github.com/caolan/async。

具有cheerio和请求的节点j中的增量和非增量URL

2 个答案: