Question

我从网站解析HTML代码，我差不多完成了。我有从网站上需要的文本部分，但偶尔有一些链接包含在HTMl中，我希望摆脱它。我正在考虑使用这样一个事实：我不想要的所有元素都以'＆lt;'开头当然还有'＆gt;'。反正有没有这样做？这是我到目前为止所做的。

for(int i = 0; i<desc.length();i++)
    {
        if(desc.charAt(i)==('<')){

        }
    }

desc是我想修剪的字符串。

Answer 1

我会尝试这样的事情;

StringBuilder sb = new StringBuilder();
boolean open = false;
for (char c : desc.toCharArray()) { // iterate over the characters
  if (c == '<') { // if we hit a less then store that the tag is open.
    open = true;
  } else if (open && c == '>') { // if the tag is open and the close symbol hits close.
    open = false;
  } else if (!open) { // if the tag isn't open
    sb.append(c);
  }
}
System.out.println(sb.toString()); // print the string.

Answer 2

通常认为手动解析XML和HTML等标记语言是个坏主意。但是，如果您只是想删除所有元素，我可以看到一个简单的脚本可能在哪里有用。

我认为值得一提的是，如果删除HTML的所有元素，您可能会将几段文本卡在一起。查看这段代码，看看它是否有帮助。

public class RemoveHtmlElements {

    public static void main(String[] args) {

        String html = "<!DOCTYPE html><html><body><h1>My First Heading</h1>"
                + "<p>My first paragraph.</p></body></html>";

        boolean elementsExist = true;
        while(elementsExist) {
            if(html.contains("<")) {
                int open = html.indexOf("<");
                int closed = html.indexOf(">", open);
                html = html.substring(0, open) + " " + html.substring(closed + 1);
            } else {
                elementsExist = false;
            }
        }

        System.out.println(html);

    }

}

这应该清除任何括号内元素的HTML。它将输入一个空格，删除元素以防止文本被意外地卡在一起。

从String中删除段

2 个答案: