Question

我正在寻找一个请求网页的示例，等待JavaScript呈现（JavaScript修改DOM），然后抓取页面的HTML。

这应该是一个简单的例子，有一个明显的PhantomJS用例。我找不到一个像样的例子，文档似乎都是关于命令行的使用。

Answer 1

根据你的评论，我猜你有两个选择

尝试查找phantomjs节点模块 - https://github.com/amir20/phantomjs-node
将phantomjs作为节点内的子进程运行 - http://nodejs.org/api/child_process.html

编辑：

似乎phantomjs建议将子进程作为与节点交互的一种方式，参见faq - http://code.google.com/p/phantomjs/wiki/FAQ

编辑：

用于获取页面HTML标记的示例Phantomjs脚本：

var page = require('webpage').create();  
page.open('http://www.google.com', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var p = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML
        });
        console.log(p);
    }
    phantom.exit();
});

Answer 2

使用phantomjs-node的v2，在处理HTML之后很容易打印HTML。

var phantom = require('phantom');

phantom.create().then(function(ph) {
  ph.createPage().then(function(page) {
    page.open('https://stackoverflow.com/').then(function(status) {
      console.log(status);
      page.property('content').then(function(content) {
        console.log(content);
        page.close();
        ph.exit();
      });
    });
  });
});

这将显示使用浏览器呈现的输出。

编辑2019：

您可以使用async/await：

const phantom = require('phantom');

(async function() {
  const instance = await phantom.create();
  const page = await instance.createPage();
  await page.on('onResourceRequested', function(requestData) {
    console.info('Requesting', requestData.url);
  });

  const status = await page.open('https://stackoverflow.com/');
  const content = await page.property('content');
  console.log(content);

  await instance.exit();
})();

或者如果您只想测试，可以使用npx

npx phantom@latest https://stackoverflow.com/

Answer 3

我过去曾使用过两种不同的方法，包括查询Declan提到的DOM的page.evaluate（）方法。我从网页传递信息的另一种方法是从那里将它吐出到console.log（），并在phantomjs脚本中使用：

page.onConsoleMessage = function (msg, line, source) {
  console.log('console [' +source +':' +line +']> ' +msg);
}

我也可能在onConsoleMessage中捕获变量msg并搜索一些封装数据。取决于您希望如何使用输出。

然后在Nodejs脚本中，你必须扫描Phantomjs脚本的输出：

var yourfunc = function(...params...) {
  var phantom = spawn('phantomjs', [...args]);
  phantom.stdout.setEncoding('utf8');
  phantom.stdout.on('data', function(data) {
    //parse or echo data
    var str_phantom_output = data.toString();
    // The above will get triggered one or more times, so you'll need to
    // add code to parse for whatever info you're expecting from the browser
  });
  phantom.stderr.on('data', function(data) {
    // do something with error data
  });
  phantom.on('exit', function(code) {
    if (code !== 0) {
      // console.log('phantomjs exited with code ' +code);
    } else {
      // clean exit: do something else such as a passed-in callback
    }
  });
}

希望有所帮助。

Answer 4

为什么不直接使用它？

var page = require('webpage').create();
page.open("http://example.com", function (status)
{
    if (status !== 'success') 
    {
        console.log('FAIL to load the address');            
    } 
    else 
    {
        console.log('Success in fetching the page');
        console.log(page.content);
    }
    phantom.exit();
});

Answer 5

如果有人在这个问题上遇到麻烦，可以延迟更新：

我的一位同事开发的GitHub项目正是为了帮助你做到这一点：https://github.com/vmeurisse/phantomCrawl。

它仍然有点年轻，它肯定缺少一些文档，但提供的示例应该有助于进行基本爬行。

Answer 6

这是我使用运行node，express和phantomjs的旧版本，它将页面保存为.png。您可以相当快地调整它来获取HTML。

https://github.com/wehrhaus/sitescrape.git

使用PhantomJS和node.js保存并呈现网页

6 个答案: