刮取在浏览器中呈现的文本

时间:2019-02-15 23:01:14

标签: javascript web-scraping innertext

我正在尝试使用innerText属性从html提取文本,例如: console.log(document.getElementById('row')。innerText)

但是,输出的方式与我在浏览器中看到的方式不同。

之所以不同,是因为第一种情况下的table元素包含内联块样式(请参见下文)。

如何解决该问题,以便获得与浏览器中显示的格式相同的文本?

情况1: 输入:

Promise

预期输出:

Promise

实际输出

it('Should only poll maxAttempts + 1 times', async () => {  // use an async test function
  jest.useFakeTimers();
  const onSuccessCallback = () => 'success!';
  const onFailureCallback = () => 'failed';
  const getStub = sinon.stub(Axios, 'get');
  getStub.rejects();

  const maxAttempts = 1;
  ssl.waitForSsl({
    onSuccess: onSuccessCallback,
    onFailure: onFailureCallback,
    maxAttempts
  });

  for (let i = 0; i < maxAttempts; i++) {
    jest.advanceTimersByTime(5000);  // advance the time
    await Promise.resolve();  // allow queued Promise callbacks to run
  }
  expect(setTimeout).toHaveBeenCalledTimes(2);  // SUCCESS
});

情况2: 输入:

<html>
   <body id='test'>
      <table style="display: inline-block">
         <tr>
            <td>1</td>
         </tr>
         <tr>
            <td>2</td>
         </tr>
      </table>
      <table style="display: inline-block">
         <tr>
            <td>3</td>
         </tr>
         <tr>
            <td>4</td>
         </tr>
      </table>
   </body>
</html>

预期输出:

1 3
2 4

实际输出

1
2
3
4

1 个答案:

答案 0 :(得分:0)

虽然似乎应该有一种更简单的方法,但是DOM无法理解可见顺序,因此您可能必须手动转置值,例如:

    // Populates domOrder from DOM (Note: These example selectors are fragile)
    const domOrder = [], visibleOrder = [];
    // Uses spread operator to get an array of tables
    const inlineTables = [...document.querySelectorAll("table")]
      .filter(table => table.style.display == "inline-block")
        .forEach(table => {
          // Gets rows
          [...table.children]
            // I'm not certain whether splitting on newlines is always reliable
            .forEach(tr => domOrder.push(tr.innerText.split(/\n/g)));
        });
    // Populates visibleOrder by transposing values from domOrder
    const rowCount = domOrder.length;
    const colCount = domOrder[0].length;
    domOrder[0].forEach( (col, colNum) => { 
      // Adds a row to visibleOrder
      visibleOrder[colNum] = []; 
      // Transposes the values 
      domOrder.forEach( (row, rowNum) => {
        visibleOrder[colNum][rowNum] = domOrder[rowNum][colNum];
      });
    });
    console.log(visibleOrder);
    <table style="display: inline-block">
       <tr><td>1</td></tr>
       <tr><td>2</td></tr>
    </table>
    <table style="display: inline-block">
       <tr><td>3</td></tr>
       <tr><td>4</td></tr>
    </table>
    <table style="display: inline-block">
       <tr><td>5</td></tr>
       <tr><td>6</td></tr>
    </table>

这是matrix transposition的更强大的示例。