Question

需要从html文件中删除所有网页内容，仅保留HTML标记。

可以通过正则表达式或JavaScript来完成吗？

之前：

<html>
<head>
<title>Ask a Question - Stack Overflow</title>
<link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico">
<script type="text/javascript">
document.write("Code remains un-touched");
</script>
</head>
<body class="ask-page new-topbar">
<div id="first">ONE</div>
<div id="sec">TWO</div>
<div id="third">THREE</div>
</body>
</html>

之后：

<html>
<head>
<title></title>
<link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico">
<script type="text/javascript">
document.write("Code remains un-touched");
</script>
</head>
<body class="ask-page new-topbar">
<div id="first"></div>
<div id="sec"></div>
<div id="third"></div>
</body>
</html>

更新：需要使用以后的HTML标记，在删除网页内容后，应显示html。最后，我对HTML代码感兴趣。

Answer 1

一个简单的递归函数可以工作：

(function removeTextNodes(el) {
  Array.apply([], el.childNodes).forEach(function (child) {
    if (child.nodeType === 3 && el.nodeName !== 'SCRIPT') {
      // remove the text node
      el.removeChild(child);
    }
    else if (child.nodeType === 1) {
      // call recursive for child nodes
      removeTextNodes(child);
    }
  });
})(document.documentElement);

引用Amadan：只需使用document.documentElement.outerHTML将html作为字符串。

Answer 2

我认为这样的事情应该有效：

$('*').each(function() {
  $(this).contents().filter(function() {
    return this.nodeType == 3 && this.parentNode.nodeName != 'SCRIPT';
  }).remove();
});

迭代所有元素，查看所有子节点，如果它们是文本节点而不在script内，则杀死它们。

您可以在以下页面进行测试：P

（Yoshi的jQueryless脚本速度更快，但编写时间更短：P）

编辑：nodeName有上限。糟糕。

编辑OP的编辑：随后将获取源代码：

$('html')[0].outerHTML

您可以使用以下方式显示它：

$('body').text($('html')[0].outerHTML)

再次编辑：此外，如果你想要它jQueryless，你也可以改为document.documentElement.outerHTML（这更快和更好）。也适用于Yoshi的解决方案。

删除仅保留HTML的所有网络文字内容？

2 个答案: