Question

我正在构建一个Chrome扩展程序，它根据匹配某些正则表达式添加指向网页的链接。我正在使用JQuery从body标签中获取所有文本节点，如下所示，然后匹配正则表达式并在必要时添加链接：

$('*', 'body').contents().filter(function() {
  return this.nodeType === 3
}).each(function() {
  regexMatchFn($(this), $(this).text());
});

因此，这适用于HTML页面正文中的标签中包含的文本。但是，我正在测试一些页面，其中文本未包含在标记中，并且我无法使用上述方法捕获它。

这是一个给我带来麻烦的标记示例：

<body>
  text-not-captured
  <p>text-captured</p>
  <p>text-captured</p>
</body>

在这样的场景中捕获未捕获文本的最佳方法是什么？

Answer 1

实际上，只需这样做：

$("body").text()

将在没有标签的情况下获取正文内的所有文本。

但要小心，因为这也包括<script>标签内的标签，这可能不是你想要的。

如果你想从脚本标签中获取所有内容，你可以改为：

var all = $("body").html();
console.log($($.parseHTML(all)).text());

Answer 2

不确定你在寻找什么，这是你的想法吗？

我可能在jQuery add()上有点生疏，但我认为你只需要获取body元素的内容，然后对文本节点进行过滤。之后，您可以add除脚本标记之外的所有其他元素：

＆＃13;

$('body').contents().filter(function() {
  return this.nodeType === 3 && this.wholeText.replace(/\s+/g, '') !== ''
}).add('body *:not(script)').each(function() {
  console.log($(this).text());
});

＆＃13;

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<body>
  text-not-captured
  <p>text-captured</p>
  <p>text-captured</p>
</body>

＆＃13;

Answer 3

此代码使用正则表达式仅选择body标签中的文本。它不包含脚本标记内的文本，也不包含子元素中的文本。我不确定你的所有代码，但这应该有所帮助。

// remove the scripts from the page  
$("body > script").remove();

// regex match only text in the body tag
var requiredText = document.body.innerHTML.match(/(\w+)(?![^<]*>|[^<>]*<\/)/igm);

console.dir(requiredText);

示例小提琴https://jsfiddle.net/mikeferrari/wrfwo5mu/

JQuery - 查找未包含在标记中的文本

3 个答案: