Question

我试图找到一种方法，使用javascript或jquery编写一个函数，从页面中删除所有的html标签，并给我这个页面的纯文本。

如何做到这一点？任何想法？

Answer 1

IE＆amp; WebKit的

document.body.innerText

其他：

document.body.textContent

（由Amr ElGarhy建议）

大多数js框架实现了一种crossbrowser方式来实现这一点。这通常有点像这样实现：

text = document.body.textContent || document.body.innerText;

似乎WebKit使用textContent保留一些格式，而使用innerText删除所有内容。

Answer 2

这取决于您要保留多少格式。但是使用jQuery，你可以这样做：

jQuery(document.body).text();

Answer 3

textContent或innerText的唯一问题是它们可以将来自相邻节点的文本阻塞在一起，它们之间没有任何空白区域。

如果这很重要，你可以通过正文或其他容器诅咒并返回数组中的文本，并用空格或换行符加入它们。

document.deepText= function(hoo){
    var A= [], tem, tx;
    if(hoo){
        hoo= hoo.firstChild;
        while(hoo!= null){
            if(hoo.nodeType== 3){
                tx= hoo.data || '';
                if(/\S/.test(tx)) A[A.length]= tx;
            }
            else A= A.concat(document.deepText(hoo));
            hoo= hoo.nextSibling;
        }
    }
    return A;
}
alert(document.deepText(document.body).join(' '))
// return document.deepText(document.body).join('\n')

Answer 4

我必须将HTML电子邮件中的富文本转换为纯文本。以下在IE中为我工作（obj是一个jQuery对象）：

function getTextFromHTML(obj) {
    var ni = document.createNodeIterator(obj[0], NodeFilter.SHOW_TEXT, null, false);
    var nodeLine = ni.nextNode();   // go to first node of our NodeIterator
    var plainText = "";

    while (nodeLine) {
        plainText += nodeLine.nodeValue + "\n";
        nodeLine = ni.nextNode();
    }

    return plainText;
 }

Answer 5

使用htmlClean。

Answer 6

我会用：

<script language="javascript" type="text/javascript" src="http://code.jquery.com/jquery-1.4.2.js"></script>
<script type="text/javascript">
    jQuery.fn.stripTags = function() { return this.replaceWith( this.html().replace(/<\/?[^>]+>/gi, '') ); };
    jQuery('head').stripTags();

    $(document).ready(function() {
        $("img").each(function() {
            jQuery(this).remove();
        });
    });
</script>

这将不发布任何样式，但会删除所有代码。

这就是你想要的吗？

[编辑]现已编辑，包括删除图片代码[/ EDIT]

如何使用javascript将网页作为纯文本而不使用任何html？

6 个答案: