Question

我尝试提取页面的所有文本内容（因为它不适用于Simpledomparser）

我尝试从手册中修改这个简单的例子

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementById('myagent').textContent;
        });
        console.log(ua);
    }
    phantom.exit();
});

我尝试改变

return document.getElementById('myagent').textContent;

到

return document.textContent;

这不起作用。

做这件事的正确方法是什么？

Answer 1

此版本的脚本应返回页面的全部内容：

var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].outerHTML;
        });
        console.log(ua);
    }
    phantom.exit();
});

Answer 2

有多种方法可以将字符串内容检索为字符串：

page.content提供完整的来源，包括标记（<html>）和doctype（<!DOCTYPE html>），
document.documentElement.outerHTML（通过page.evaluate）提供完整的来源，包括标记（<html>），但没有doctype，
document.documentElement.textContent（通过page.evaluate）提供完整文档的累积文本内容，包括内联CSS＆amp; JavaScript，但没有标记，
document.documentElement.innerText（通过page.evaluate）提供完整文档的累积文本内容，不包括内联CSS＆amp; JavaScript和没有标记。

document.documentElement可以通过您选择的元素或查询进行交换。

Answer 3

要提取网页的文字内容，您可以尝试return document.body.textContent;，但我不确定结果是否可用。

Answer 4

在尝试解决类似问题时遇到了这个问题，我最终调整了this question的解决方案，如下所示：

var fs = require('fs');
var file_h = fs.open('header.html', 'r');
var line = file_h.readLine();
var header = "";

while(!file_h.atEnd()) {

    line = file_h.readLine(); 
    header += line;

}
console.log(header);

file_h.close();
phantom.exit();

这给了我一个带有读入HTML文件的字符串，这个文件足以满足我的目的，希望可以帮助其他人遇到这个问题。

这个问题似乎含糊不清（是文件的全部内容，还是文本＆＃34;又名字符串？）所以这是一个可能的解决方案。

使用PhantomJS提取HTML和文本

4 个答案: