Question

我正在尝试构建一个网页内容的字符串，没有HTML语法（可能用空格替换它，因此单词不是全部连接）或标点符号。

所以说你有代码：

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

我想要返回字符串：

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

任何想法如何做到这一点？感谢。

Answer 1

您可以使用innerText属性（而不是innerHTML，它也会返回HTML标记）：

var content = document.getElementsByTagName("body")[0].innerText;

但是，请注意，这也会包含新行，因此如果您完全按照您在问题中指定的内容进行操作，则需要将其删除。

Answer 2

某些浏览器支持W3C DOM 3 Core textContent 属性，或其他浏览器支持的MS / HTML5 innerText 属性（有些支持两者）。可能脚本元素的内容是不需要的，因此DOM树的相关部分的递归遍历似乎是最好的：

// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
  var text = [];
  var self = arguments.callee;
  var el, els = element.childNodes;

  for (var i=0, iLen=els.length; i<iLen; i++) {
    el = els[i];

    // May need to add other node types here
    // Exclude script element content
    if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
      text.push(self(el));

    // If working with XML, add nodeType 4 to get text from CDATA nodes
    } else if (el.nodeType == 3) {

      // Deal with extra whitespace and returns in text here.
      text.push(el.data);
    }
  }
  return text.join('');
}

Answer 3

你需要一个striptags function in javascript和一个正则表达式来用一个空格替换连续的换行符。

Answer 4

您可以尝试使用下面的替换声明

var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");

对于您在上面提供的HTML，这将为您提供内容中的以下字符串

   Content:   paragraph 1   paragraph 2    alert("blah blah blah");   This is some text  ....and some more

来自document.body.innerHTML的javascript HTML

4 个答案: