Question

我有一个查看HTML并替换文本的插件。但是，使用脚本标记中的当前实现文本也会在搜索中被捕获。这会导致受影响页面上的脚本损坏。

var pageText = document.body.innerHTML;
document.body.innerHTML = pageText.replace(regextgoeshere);

我尽力通过我的正则表达式模式对其进行过滤，但我需要弄清楚如何跳过所有标记。

有没有办法在获取innerHTML时跳过所有标记？

Answer 1

我认为我们倾向于思考元素并错过节点！但是，通过思考节点可以最好地解决这个问题。

澳大利亚Alex有最好的解决方案 http://blog.alexanderdickson.com/javascript-replacing-text

function myRecursiveSearch(node,.....) {

   var excludeElements = ['script', 'style', 'iframe', 'canvas'];

   var child = node.firstChild;

   if(child==null)
     return;

    do {
        switch (child.nodeType) {

        case 1:
            if (excludeElements.indexOf(child.tagName.toLowerCase()) > -1) {
                continue;
            }

            myRecursiveSearch(child,.....);
            break;

        case 3:
           child.nodeValue=doReolace(child.nodeValue,.....);
           break;

        }

    } while (child = child.nextSibling);

}


function doTranslit(strtext,....) {
   .....
   return strtext;
}

Answer 2

编辑：我误解了你的要求

如果你想要更复杂的东西，可以试试Douglas Crockford的walking the DOM功能：

function walkTheDOM(node, func) {
    func(node);
    node = node.firstChild;
    while (node) {
        walkTheDOM(node, func);
        node = node.nextSibling;
    }
}

您可以使用tagName的{{1}}属性跳过node元素：

<script>

Answer 3

也许你最好的选择是使用querySelectorAll并否定不受欢迎的元素。然后替换textContent而不是innerHTML。通过使用innerHTML，您可能会破坏文档标记。

这是一种跨浏览器的解决方案。

var matches = document.querySelectorAll("*:not(html):not(head):not(script):not(meta):not(link)");
console.log(matches);
[].forEach.call(matches, function(elem) {
  var text = ('innerText' in elem) ? 'innerText' : 'textContent';
  elem[text] = elem[text].replace("this", "works");
});

http://jsfiddle.net/m6qhuesv/

注1：HTML，HEAD，META和LINK标签不允许修改textContext。

注2：innerText是IE专有的东西（也适用于chrome）。 W3C将textContent定义为官方财产。

Answer 4

没有检查，但你可以尝试。

var pageText = document.body.innerHTML;
mypagewithoutScriptTag = pageText.replace(<script>(.*?)</script>);

如何访问innerHTML但忽略<script>标记</script>

4 个答案: