我正在使用node.js和puppeteer来获取一些数据。 ...但是数据呈现时没有第td等行元素。如果我复制目标的外部html,则会得到以下信息:
<pre>
<strong>Date Filed File ID Code Company Name</strong>
<hr>
08-24-2018 <a href="/117-index.html">ABC/A</a> <a href="url;id=777">777</a> Company A
08-24-2018 <a href="/007-index.html">ABC/A</a> <a href="url;id=612">612</a> Company B
08-24-2018 <a href="/750-index.html">ABC/A</a> <a href="url;id=619">619</a> Company C
<hr>
</pre>
如何从这4列(第1列:提交日期,第2列:文件,第3列:ID代码和第4列:公司名称)中获取数据?
我已经在开发工具中看到了这样的图片
<pre>
<strong>Date Filed File ID Code Company Name</strong>
<hr>
08-24-2018
<a href="/117-index.html">ABC/A</a>
<a href="url;id=777">777</a>
Company A 08-24-2018
<a href="/007-index.html">ABC/A</a>
<a href="url;id=612">612</a>
Company B 08-24-2018
<a href="/750-index.html">ABC/A</a>
<a href="url;id=619">619</a>
Company C
<hr>
</pre>
...,当我单击它时,它就像:
<pre>
<strong>Date Filed File ID Code Company Name</strong>
<hr>
08-24-2018
<a href="/117-index.html">ABC/A</a>
<a href="url;id=777">777</a>
Company A
08-24-2018
<a href="/007-index.html">ABC/A</a>
<a href="url;id=612">612</a>
Company B
08-24-2018
<a href="/750-index.html">ABC/A</a>
<a href="url;id=619">619</a>
Company C
<hr>
</pre>
当我管理链接数时,我得到了6 .... app.js
const puppeteer = require('puppeteer');
const fs = require('fs-extra');
(async function main() {
try {
const browser = await puppeteer.launch({ headless: false })
const page = await browser.newPage();
await page.goto('url', {waitUntil: 'load'});
const table = await page.waitForSelector('body table');
const rows = await page.$$('body table pre a');
console.log(rows.length);
...
} catch (e) {
console.log('our error', e);
}
})();
但是我该如何正确获取这些数据?
编辑:OuterHTML
const pre = await page.$('body table pre');
const preVal = await page.evaluate( pre => pre.outerHTML, pre );
console.log(preVal);
<pre><strong>Date Filed File ID Code Company Name</strong><hr>08-24-2018 <a href="/117-index.html">ABC</a> <a href="url;id=777">777</a> Company A
08-24-2018 <a href="/007-index.html">ABC</a> <a href="url;id=612">612</a> Company B
08-24-2018 <a href="/750-index.html">ABC</a> <a href="url;id=619">619</a> Company C
<hr></pre>
答案 0 :(得分:2)
使用第一个代码段,您可以使用以下方法提取数据:
const result = await page.evaluate( () =>
{
return document.getElementsByTagName( 'pre' )[0].innerHTML.split( '<hr>' )[1].trim().split( '\n' ).map( element =>
{
const parser = new DOMParser();
const cells = element.trim().split( / {2,}/ );
cells.splice( 2, 0, parser.parseFromString( cells[1], 'text/html' ).getElementsByTagName( 'a' )[0].textContent );
cells.splice( 4, 0, parser.parseFromString( cells[3], 'text/html' ).getElementsByTagName( 'a' )[0].textContent );
return {
'date_filed' : cells[0],
'file' : cells[1],
'file_text' : cells[2],
'id_code' : cells[3],
'id_code_text' : cells[4],
'company_name' : cells[5]
};
});
});
console.log( result[0].date_filed ); // 08-24-2018
console.log( result[1].date_filed ); // 08-24-2018
console.log( result[2].date_filed ); // 08-24-2018
console.log( result[0].file ); // <a href="/117-index.html">ABC/A</a>
console.log( result[1].file ); // <a href="/007-index.html">ABC/A</a>
console.log( result[2].file ); // <a href="/750-index.html">ABC/A</a>
console.log( result[0].file_text ); // ABC/A
console.log( result[1].file_text ); // ABC/A
console.log( result[2].file_text ); // ABC/A
console.log( result[0].id_code ); // <a href="url;id=777">777</a>
console.log( result[1].id_code ); // <a href="url;id=612">612</a>
console.log( result[2].id_code ); // <a href="url;id=619">619</a>
console.log( result[0].id_code_text ); // 777
console.log( result[1].id_code_text ); // 612
console.log( result[2].id_code_text ); // 619
console.log( result[0].company_name ); // Company A
console.log( result[1].company_name ); // Company B
console.log( result[2].company_name ); // Company C