Question

我有一个实验室作业，我一直在谈论删除html标签。以下是删除html标记的方法：

public String getFilteredPageContents() {
    String str = getUnfilteredPageContents();
    String temp = "";
    boolean b = false;
    for(int i = 0; i<str.length(); i++) {
        if(str.charAt(i) == '&' || str.charAt(i) == '<') {
            b = true;
        }
        if(b == false) {
            temp += str.charAt(i);
        }
        if(str.charAt(i) == '>' || str.charAt(i) == ';') {
            b = false;
        }
    }
    return temp;
}

这是我的文本最早的形式：

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>

<head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 2.0">
<title>A Shropshire Lad</title>
</head>

<body bgcolor="#008000" text="#FFFFFF" topmargin="10"
leftmargin="20">

<p align="center"><font size="6"><strong></strong></font>&nbsp;</p>
<div align="center"><center>

<pre><font size="7"><strong>A Shropshire Lad
</strong></font><strong>
by A.E. Housman
Published by Dover 1990</strong></pre>
</center></div>

<p><strong>This collection of sixty three poems appeared in 1896.
Many of them make references to Shrewsbury and Shropshire,
however, Housman was not a native of the county. The Shropshire
of his book is a mindscape in which he blends old ballad meters,
classical reminiscences and intense emotional experiences
&quot;recollected in tranquility.&quot; Although they are not
particularly to my taste, their style, simplicity and
timelessness are obvious even to me. Below are two short poems
which amused me, I hope you find them interesting too.</strong></p>

<hr size="8" width="80%" color="#FFFFFF">
<div align="left">

<pre><font size="5"><strong><u>
XIII</u></strong></font><font size="4"><strong>

When I was one-and-twenty
I heard a wise man say,
'Give crowns and pounds and guineas
But not your heart away;</strong></font></pre>
</div><div align="left">

<pre><font size="4"><strong>Give pearls away and rubies
But keep your fancy free.
But I was one-and-twenty,
No use to talk to me.</strong></font></pre>
</div><div align="left">

<pre><font size="4"><strong>When I was one-and-twenty
I heard him say again,
'The heart out of the bosom
Was never given in vain;
'Tis paid with sighs a plenty
And sold for endless rue'
And I am two-and-twenty,
And oh, 'tis true 'tis true.

</strong></font><strong></strong></pre>
</div>

<hr size="8" width="80%" color="#FFFFFF">

<pre><font size="5"><strong><u>LVI . The Day of Battle</u></strong></font><font
size="4"><strong>

'Far I hear the bugle blow
To call me where I would not go,
And the guns begin the song,
&quot;Soldier, fly or stay for long.&quot;</strong></font></pre>

<pre><font size="4"><strong>'Comrade, if to turn and fly
Made a soldier never die,
Fly I would, for who would not?
'Tis sure no pleasure to be shot.</strong></font></pre>

<pre><font size="4"><strong>'But since the man that runs away
Lives to die another day,
And cowards' funerals, when they come,
Are not wept so well at home,</strong></font></pre>

<pre><font size="4"><strong>'Therefore, though the best is bad,
Stand and do the best, my lad;
Stand and fight and see your slain,
And take the bullet in your brain.'</strong></font></pre>

<hr size="8" width="80%" color="#FFFFFF">
</body>
</html>

在本文中实现我的方法时：

 charset=iso-8859-1">

A Shropshire Lad







A Shropshire Lad

by A.E. Housman
Published by Dover 1990


This collection of sixty three poems appeared in 1896.
Many of them make references to Shrewsbury and Shropshire,
however, Housman was not a native of the county. The Shropshire
of his book is a mindscape in which he blends old ballad meters,
classical reminiscences and intense emotional experiences
recollected in tranquility. Although they are not
particularly to my taste, their style, simplicity and
timelessness are obvious even to me. Below are two short poems
which amused me, I hope you find them interesting too.
.
.
.

我的问题是：如何在文本charset=iso-8859-1">的最开头摆脱那些小代码。我无法摆脱那堆代码？感谢...

Answer 1

我可以看到您的意图是删除看似<xxx>和&xxx;的内容。您正在使用变量b来记住您当前是否正在跳过某些内容。

您是否注意到您的算法会跳过<xxx;和&xxx>形式的内容？即，&或<会导致跳过开始，而>或;会导致跳过结束，但您不必匹配<使用>或&使用;。那么如何实现代码来记住哪个角色开始跳过？

另一个复杂因素是，&xxx;内容可以嵌入<xxx>内容中，例如：<p title="&">

顺便说一下，当字符串很长时，temp += str.charAt(i);会使你的程序变得很慢。请改为使用StringBuilder。

以下是一些可以解决您的问题的代码，或几乎是：

import java.util.Stack;

public String getFilteredPageContents() {
    String str = getUnfilteredPageContents();
    StringBuilder() temp = new StringBuilder();

    // The closing character for each thing that we're inside
    Stack<Character> expectedClosing = new Stack<Character>();

    for(int i = 0; i<str.length(); i++) {
        char c = str.charAt(i);
        if(c == '<')
            expectedClosing.push('>');
        else if(c == '&')
            expectedClosing.push(';');

        // Is the current character going to close something?
        else if(!expectedClosing.empty() && c == expectedClosing.peek())
            expectedClosing.pop();

        else {
            // Only add to output if not currently inside something
            if(expectedClosing.empty())
                temp.append(c);
        }
    }
    return temp.toString();
}

Answer 2

这是一项学校作业，但是你有可能使用格式良好的HTML解析器，例如this来完成工作吗？

Answer 3

解决此问题的最优雅方法可能是使用regular expressions。使用它们，您可以专门搜索标记结构并将其从输出中删除。

然而，由于你已经编写了一个程序并且它工作正常，除了你提到的问题，快速＆amp;肮脏的解决方案可能就足够了。

我能想到的一件事就是应用类似过滤器的算法，逐行扫描文本输出，如果它们存在则删除它们。就像阅读每一行并检查最后一个字符是否为>一样。如果是删除行/用空字符串替换它。在普通文本中，不应该有任何>和句子的结尾，所以你不应该在那里遇到太多麻烦。

删除剩余的html标签

3 个答案: