仅从网页中提取文本内容

时间:2015-09-28 14:49:55

标签: javascript jquery html

我需要从网页中提取所有文本内容。我使用了' document.body.textContent'。 但我也得到了javascript内容。我如何确保只获得可读的文本内容?



function myFunction() {
  var str = document.body.textContent
  alert(str);
}

<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:3)

在执行body.textContent之前,只需删除您不想阅读的标签。

function myFunction() {
  var bodyScripts = document.querySelectorAll("body script");
  for(var i=0; i<bodyScripts.length; i++){
      bodyScripts[i].remove();
  }
  var str = document.body.textContent;
  document.body.innerHTML = '<pre>'+str+'</pre>';
}
<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>