使用NodeJS中的Cheerio替换HTML文本

时间:2018-05-23 10:48:26

标签: html node.js cheerio

我想用结构标记替换结构化HTML中所有出现的单词。

例如,给定像这样的HTML

<p>
    Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce porttitor, magna nec sollicitudin varius, ligula nisi finibus nulla, vel posuere libero erat eu tortor.
</p>
<p>
    <ul>
        <li>Lorem</li>
        <li>ipsum</li>
        <li>dolor</li>
        <li>sit</li>
        <li>amet</li>
    </ul>
</p>
<p>
    Lorem <b>ipsum</b> <span><em>dolor</em></span> sit amet, consectetur adipiscing elit.
</p>

我想替换“ipsum”这个词的所有出现。使用此标记

<a href="https://www.google.com/search?q=ipsum">ipsum</a>

在这种情况下,我尝试了一个非常简单的解决方案:

const $ = cheerio.load(lorem_ipsum_html);
let words = $.text().trim().split(' ');
for (let t in words) {
    let res = words[t];
    if (words[t] == 'ipsum') res = '<a href="https://www.google.com/search?q=ipsum">ipsum</a>';
    $.html().replace(words[t], res);
}
return $.html();  

在这种情况下,函数返回未更改的html,即使替换看起来像是有效的。 最重要的是,我还尝试移植了几个jQuery实现,例如:

Replace text with HTML element

Using .replace to replace text with HTML?

没有运气。

3 个答案:

答案 0 :(得分:0)

1 - 用cheerio加载身体

var $ = cheerio.load(body);

2 - 使用此递归功能,您可以替换所有元素及其子元素中的目标

function replacer($, text) {
    if ($(text).children().length) {
        $(text).children().each(function (itm) {
            return replacer($, $(this));
        });
    }
    else {
        var value = $(text).text();
        value = value.replace(/ipsum/g, '<a href="https://www.google.com/search?q=ipsum">ipsum</a>');
        return $(text).text(value);
    }
}

3 - 使用此

将cheerio dom节点恢复为html
return $.html(bb);

4 - 用正确的符号替换所有&quot;&lt;&gt;

f(b).replace(/&lt;/g,'<').replace(/&gt;/g, '>').replace(/&quot;/g, '"')

我希望这会对你有所帮助。只需修改您想要的代码

var b = `<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce porttitor, magna nec sollicitudin varius, ligula nisi finibus nulla, vel posuere libero erat eu tortor.
</p>
<p>
<ul>
    <li>Lorem</li>
    <li>ipsum</li>
    <li>dolor</li>
    <li>sit</li>
    <li>amet</li>
</ul>
</p>
<p>
Lorem <b>ipsum</b> <span><em>dolor</em></span> sit amet, consectetur adipiscing elit.
</p>`;

var cheerio = require('cheerio');

function replacer($, text) {
  if ($(text).children().length) {
    $(text).children().each(function(itm) {
      return replacer($, $(this));
    });
  } else {
    var value = $(text).text();
    value = value.replace(/ipsum/g, '<a href="https://www.google.com/search?q=ipsum">ipsum</a>');
    return $(text).text(value);
  }
}

function f(body) {
  var $ = cheerio.load(body);
  var bb = $("p").each(function(itm) {
    return replacer($, $(this));
  });
  return $.html(bb);
}

console.log(f(b).replace(/&lt;/g, '<').replace(/&gt;/g, '>').replace(/&quot;/g, '"'))

输出:

&#13;
&#13;
<p>
  Lorem <a href="https://www.google.com/search?q=ipsum">ipsum</a> dolor sit amet, consectetur adipiscing elit. Fusce porttitor, magna nec sollicitudin varius, ligula nisi finibus nulla, vel posuere libero erat eu tortor.
</p>
<p>
  <ul>
    <li>Lorem</li>
    <li><a href="https://www.google.com/search?q=ipsum">ipsum</a></li>
    <li>dolor</li>
    <li>sit</li>
    <li>amet</li>
  </ul>
</p>
<p>
  Lorem <b><a href="https://www.google.com/search?q=ipsum">ipsum</a></b> <span><em>dolor</em></span> sit amet, consectetur adipiscing elit.
&#13;
&#13;
&#13;

答案 1 :(得分:0)

我最终得到了这个(不太干净)的解决方案。它不是世界上最好的东西,但它有效。这里仍有改进的余地。

let $ = cheerio.load(lorem_ipsum_html);
let words = $.text().trim().split(' ');
for (let t in words) {
    let res =  words[t];
    if(words[t] == 'ipsum') res = '<a href="https://www.google.com/search?q=ipsum">ipsum</a>';
    let $ = cheerio.load($.html().replace(words[t], res));
}
return $.html();

在这种情况下,HTML结构保持不变,锚标签只是在正确的位置注入。

&#13;
&#13;
<p>
    Lorem <a href="https://www.google.com/search?q=ipsum">ipsum</a> dolor sit amet, consectetur adipiscing elit. Fusce porttitor, magna nec sollicitudin varius, ligula nisi finibus nulla, vel posuere libero erat eu tortor.
</p>
<p>
    <ul>
        <li>Lorem</li>
        <li><a href="https://www.google.com/search?q=ipsum">ipsum</a></li>
        <li>dolor</li>
        <li>sit</li>
        <li>amet</li>
    </ul>
</p>
<p>
    Lorem <b><a href="https://www.google.com/search?q=ipsum">ipsum</a></b> <span><em>dolor</em></span> sit amet, consectetur adipiscing elit.
</p>
&#13;
&#13;
&#13;

答案 2 :(得分:0)

清洁解决方案:

这是通过迭代所有 dom 文本节点来实现的代码:

const $ = require('cheerio').load(inputHtml);
const getTextNodes=(elem)=>elem.type==='text'?[]:
        elem.contents().toArray()
        .filter(el=>el!==undefined)//I don't know why some elements are undefined
        .reduce((acc, el)=>
            acc.concat(...el.type==='text'?[el]:getTextNodes($(el))), [] )
    
    
const replaceRegex = /ipsum/g;
const replacementTag =  `<a href="https://www.google.com/search?q=ipsum">ipsum</a>`;

getTextNodes($(`html`))
    .filter(node=>$.html(node).match(replaceRegex))
    .map(node=>$(node).replaceWith($.html(node).replace(replaceRegex,replacementTag))  );

console.log($.html());

输出:

<html><head></head><body><p>
    Lorem <a href="https://www.google.com/search?q=ipsum">ipsum</a> dolor sit amet, consectetur adipiscing elit. Fusce porttitor, magna nec sollicitudin varius, ligula nisi finibus nulla, vel posuere libero erat eu tortor.
</p>
<p>
    </p><ul>
        <li>Lorem</li>
        <li><a href="https://www.google.com/search?q=ipsum">ipsum</a></li>
        <li>dolor</li>
        <li>sit</li>
        <li>amet</li>
    </ul>
<p></p>
<p>
    Lorem <b><a href="https://www.google.com/search?q=ipsum">ipsum</a></b> <span><em>dolor</em></span> sit amet, consectetur adipiscing elit.
</p></body></html>

原答案here