Question

我正在尝试使用jsdom从文章中获取描述。该文章的HTML代码是

<p><img src="http://localhost/bibi_cms/cms/app/images/upload_photo/1506653694941.png" 
style="width: 599.783px; height: 1066px;"></p>
<p>testestestestestestestest<br></p>

这是我的nodejs代码，用于从内容中获取描述，似乎它将从第一个p标签获取文本并打印出空字符串。所以我只想获取不包含图片的 p 标签中的内容。有谁帮我解决这个问题？

const dom = new JSDOM(results[i].content.toString());
if (dom.window.document.querySelector("p") !== null)
results[i].description = dom.window.document.querySelector("p").textContent;

Answer 1

理想情况下，您可以针对Node.TEXT_NODE进行测试，但由于某些原因，我在nodejs上出错了（使用gulp仅用于测试目的）：

const gulp = require("gulp");
const fs = require('fs');

const jsdom = require("jsdom");
const { JSDOM } = jsdom;

const html = yourHTML.html';

gulp.task('default', ['getText']);

gulp.task('getText', function () {

  var dirty;
  dirty = fs.readFileSync(html, 'utf8');

  const dom = new JSDOM(dirty);
  const pList = dom.window.document.querySelectorAll("p");

  pList.forEach(function (el, index, list) {

    console.log("p.firstElementChild.nodeName : " + el.firstElementChild.nodeName);

    if (el.firstElementChild.nodeName !== "IMG") {
      console.log(el.textContent);
    }
 });

 return;
})

所以关键是测试

el.firstElementChild.nodeName !== "IMG"

如果您知道img标记或文本跟在p标记之后。在你的情况下，你想要的firstElementChild.nodeName实际上是一个br标签，但我认为不一定总是在文本的末尾。

您还可以测试空字符串ala：

  if (el.textContent.trim() !== "") {}  // you may want to trim() that for spaces

jsdom得到没有图像的文本

1 个答案: