Question

如何遍历HTML中的所有标题并使用node.js中唯一的div包装它们？

我不能使用正则表达式替换，因为div id必须是唯一的

Cheerio似乎是node.js中网页抓取的最佳框架，但我没有看到解决此用例的方法

Answer 1

好吧，据我所知，你想用一个div包装所有标题（h1-h6），其ID存储在一个数组中（或左右）。

您当然可以使用cheerio（请参阅底部的解决方案），但我认为通过类似的努力，这也可以通过RegEx实现。

// I define the HTML in a simple constant for now.
// Use it for both solutions.
const html = `
<!doctype html>
<html>
  <head>
    <meta charset="utf-8" />
    <title>Text</title>
  </head>

  <body>
    <div class="content">
      <h1>Hello world</h1>

      <p>Lorem Ipsum</p>

      <h2>This is a small HTML example</h2>
    </div>
  </body>
</html>
`;

RegEx的第一个解决方案：

// Use html-constant from above!
function convertHeadlines( html ) {
  const r = /(<h\d>[\s\S]+?<\/h\d>)/g; // See https://regex101.com/r/jNjbXh/1 for explanation
  const ids = [];
  // Replace every match and wrap it with a new DIV.
  const output = html.replace( r, ( match ) => {
    const newId = `headline${ ids.length + 1 }`;
    ids.push( newId );
    return `<div id="${ newId }">${ match }</div>`;
  } );

  return {
    ids,
    output,
  };
}

const result = convertHeadlines( html );
console.log( result );

这会生成一个对象，为您提供所有ID和新HTML。

此处使用cheerio的解决方案 - 类似方法：

// Use html-constant from above!
const cheerio = require( 'cheerio' );
function convertHeadlinesWithCheerio( html ) {
  const $ = cheerio.load( html );
  const headlines = $( 'h1, h2, h3, h4, h5, h6' );
  const ids = [];
  headlines.each( function ( i, elem ) {
    const newId = `headline${ ids.length + 1 }`;
    ids.push( newId );
    $( this ).wrap( `<div id="${ newId }"></div>` );
  } );

  return {
    ids,
    output: $.html(),
  }
}

const result = convertHeadlinesWithCheerio( html );
console.log( result );

Node.js：用div标签包裹所有标题

1 个答案: