Node.js puppeteer-获取没有元素的数据

时间:2018-08-25 14:47:01

标签: node.js puppeteer

我正在使用node.js和puppeteer来获取一些数据。 ...但是数据呈现时没有第td等行元素。如果我复制目标的外部html,则会得到以下信息:

<pre>
    <strong>Date Filed   File        ID Code     Company Name</strong>
    <hr>
    08-24-2018   <a href="/117-index.html">ABC/A</a>      <a href="url;id=777">777</a>   Company A
    08-24-2018   <a href="/007-index.html">ABC/A</a>      <a href="url;id=612">612</a>   Company B
    08-24-2018   <a href="/750-index.html">ABC/A</a>      <a href="url;id=619">619</a>   Company C
    <hr>
</pre>

如何从这4列(第1列:提交日期,第2列:文件,第3列:ID代码和第4列:公司名称)中获取数据?

我已经在开发工具中看到了这样的图片

<pre>
    <strong>Date Filed File ID Code Company Name</strong>
    <hr>
    08-24-2018   
    <a href="/117-index.html">ABC/A</a>      

    <a href="url;id=777">777</a>   
    Company A 08-24-2018   
    <a href="/007-index.html">ABC/A</a>      

    <a href="url;id=612">612</a>   
    Company B 08-24-2018   
    <a href="/750-index.html">ABC/A</a>      

    <a href="url;id=619">619</a>   
    Company C
    <hr>
</pre>

...,当我单击它时,它就像:

<pre>
    <strong>Date Filed File ID Code Company Name</strong>
    <hr>
    08-24-2018&nbsp;&nbsp;&nbsp;   
    <a href="/117-index.html">ABC/A</a>      
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <a href="url;id=777">777</a>   
    &nbsp;&nbsp;&nbsp;Company A 
    08-24-2018&nbsp;&nbsp;&nbsp;   
    <a href="/007-index.html">ABC/A</a>      
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <a href="url;id=612">612</a>   
    &nbsp;&nbsp;&nbsp;Company B 
    08-24-2018&nbsp;&nbsp;&nbsp;
    <a href="/750-index.html">ABC/A</a>      
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <a href="url;id=619">619</a>   
    &nbsp;&nbsp;&nbsp;Company C
    <hr>
</pre>

当我管理链接数时,我得到了6 .... app.js

const puppeteer = require('puppeteer');
const fs = require('fs-extra');

(async function main() {
  try {

    const browser = await puppeteer.launch({ headless: false })
    const page = await browser.newPage();

    await page.goto('url', {waitUntil: 'load'});

    const table = await page.waitForSelector('body table');
    const rows = await page.$$('body table pre a');
    console.log(rows.length);

    ...


  } catch (e) {
    console.log('our error', e);
  }

})();

但是我该如何正确获取这些数据?

编辑:OuterHTML

const pre = await page.$('body table pre');
const preVal = await page.evaluate( pre => pre.outerHTML, pre );
console.log(preVal);

<pre><strong>Date Filed   File        ID Code     Company Name</strong><hr>08-24-2018   <a href="/117-index.html">ABC</a>      <a href="url;id=777">777</a>   Company A
08-24-2018   <a href="/007-index.html">ABC</a>      <a href="url;id=612">612</a>   Company B
08-24-2018   <a href="/750-index.html">ABC</a>      <a href="url;id=619">619</a>   Company C
<hr></pre>

1 个答案:

答案 0 :(得分:2)

使用第一个代码段,您可以使用以下方法提取数据:

const result = await page.evaluate( () =>
{
    return document.getElementsByTagName( 'pre' )[0].innerHTML.split( '<hr>' )[1].trim().split( '\n' ).map( element =>
    {
        const parser = new DOMParser();
        const cells  = element.trim().split( / {2,}/ );

        cells.splice( 2, 0, parser.parseFromString( cells[1], 'text/html' ).getElementsByTagName( 'a' )[0].textContent );
        cells.splice( 4, 0, parser.parseFromString( cells[3], 'text/html' ).getElementsByTagName( 'a' )[0].textContent );

        return {
            'date_filed'   : cells[0],
            'file'         : cells[1],
            'file_text'    : cells[2],
            'id_code'      : cells[3],
            'id_code_text' : cells[4],
            'company_name' : cells[5]
        };
    });
});

console.log( result[0].date_filed );   // 08-24-2018
console.log( result[1].date_filed );   // 08-24-2018
console.log( result[2].date_filed );   // 08-24-2018

console.log( result[0].file );         // <a href="/117-index.html">ABC/A</a>
console.log( result[1].file );         // <a href="/007-index.html">ABC/A</a>
console.log( result[2].file );         // <a href="/750-index.html">ABC/A</a>

console.log( result[0].file_text );    // ABC/A
console.log( result[1].file_text );    // ABC/A
console.log( result[2].file_text );    // ABC/A

console.log( result[0].id_code );      // <a href="url;id=777">777</a>
console.log( result[1].id_code );      // <a href="url;id=612">612</a>
console.log( result[2].id_code );      // <a href="url;id=619">619</a>

console.log( result[0].id_code_text ); // 777
console.log( result[1].id_code_text ); // 612
console.log( result[2].id_code_text ); // 619

console.log( result[0].company_name ); // Company A
console.log( result[1].company_name ); // Company B
console.log( result[2].company_name ); // Company C