来自doc:
所以我尝试了这段代码:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://personalitycore.com/a.html');
let p = (await page.$x('/html/body/p'))[0]
console.log("Var[p] Class: " + p.constructor.name)
console.log("Var[p] Tag: " + await p.evaluate(e => e.tagName, p))
let spans = await p.$x('/*')
for (var i = 0; i < spans.length; i++) {
console.log("Var[spans] Tag: " + await spans[i].evaluate(e => e.tagName, spans[i]))
console.log("Var[spans] Text: " + await spans[i].evaluate(e => e.textContent, spans[i]))
}
await browser.close();
})();
http://personalitycore.com/a.html
的HTML是:
<head>
</head>
<body>
<p>
text_node1
<span>span_node1</span>
text_node2
<span>span_node2</span>
</p>
</body>
结果:
/usr/local/bin/node example.js
Var[p] Class: ElementHandle
Var[p] Tag: P
Var[spans] Tag: HTML
Var[spans] Text:
text_node1
span_node1
text_node2
span_node2
我很困惑。根据文档,p
是ElementHandle
,并且评估xpath /*
应该得到[TextNode, Span, TextNode, Span]
。
但是它返回了整个页面,标签为HTML
!
所以,我的问题:
/*
上评估p
。答案 0 :(得分:1)
您只需要将上下文节点符号(点)添加到XPath:'./*'
。没有它,'/*'
的意思是“文档的所有子元素”,即html
元素。
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch();
const html = `
<!doctype html>
<html>
<head>
</head>
<body>
<p>
text_node1
<span>span_node1</span>
text_node2
<span>span_node2</span>
</p>
</body>
</html>`;
try {
const page = await browser.newPage();
await page.goto('http://personalitycore.com/a.html');
const [p] = await page.$x('/html/body/p');
console.log("Var[p] Class: " + p.constructor.name);
console.log("Var[p] Tag: " + await p.evaluate(e => e.tagName, p));
const spans = await p.$x('./*');
for (let i = 0; i < spans.length; i++) {
console.log("Var[spans] Tag: " + await spans[i].evaluate(e => e.tagName, spans[i]));
console.log("Var[spans] Text: " + await spans[i].evaluate(e => e.textContent, spans[i]));
}
} catch(err) { console.error(err); } finally { await browser.close(); }
输出:
Var[p] Class: ElementHandle
Var[p] Tag: P
Var[spans] Tag: SPAN
Var[spans] Text: span_node1
Var[spans] Tag: SPAN
Var[spans] Text: span_node2