Question

更新：对于将此问题标记为重复的人：我正在搜索可能只包含在一个元素中的文本，或者可能分布在100个元素上。我不知道在搜索之前。我所知道的是我正在搜索的模式中的单词来自这个HTML。现在我需要做一个跳过（但记得）html / javascript的搜索，这可能与我正在寻找的文本相互交叉。

我希望这个解释有助于找到我的问题的答案。

***********更新结束***************

我正在寻找一个库或一段代码，允许在html文档中搜索和定位任意纯文本（开始/停止偏移或标记）。

示例：

寻找的模式：“我正在寻找的文字”
html文件：

<html>...<p>text that <b>I'm</b/> <span>looking
   for<div>...</div>...</p>

结果匹配：

text that <b>I'm</b/> <span>looking for

有谁知道这样的效用？感谢

Answer 1

编辑：做了一些实际的编程。该算法接受字符和HTML标记之间的HTML标记以及单词之间的空格。

const haystack = '<html>This, <b>that</b>, and\nthe<i>other</i>.</html>';
const needle = 'This, that, and the other.';

// Make a regex from the needle...
let regex = '';

// ..split the needle into words...
const words = needle.split(/\s+/);
for (let i = 0; i < words.length; i++) {
  const word = words[i];

  // ...allow HTML tags after each character except the last one in a word...
  for (let i = 0; i < word.length - 1; i++) {
    regex += word.charAt(i) + '(<.+?>)*';
  }
  regex += word.charAt(word.length - 1);

  // ...allow a mixture of whitespace and HTML tags after each word except the last one
  if (i < words.length - 1) regex += '(\\s|(<.+?>))+';
}

// Find the match, if any
const matches = haystack.match(regex);
console.log(matches);

// Report results
if (matches) {
  const match = matches[0];
  const offset = matches.index;

  console.log('Found match!');
  console.log('Offset: ' + offset);
  console.log('Length: ' + match.length);
  console.log(match);
}

搜索文本遍布html文档中的元素

1 个答案: