正则表达式剥离HTML标记

时间:2010-11-02 07:42:35

标签: java html regex

我有这个HTML输入:

<font size="5"><p>some text</p>
<p> another text</p></font>

我想使用正则表达式删除HTML标记,以便输出:

some text
another text

任何人都可以建议如何使用正则表达式执行此操作吗?

5 个答案:

答案 0 :(得分:37)

既然你问过,这是一个快速而肮脏的解决方案:

String stripped = input.replaceAll("<[^>]*>", "");

Ideone.com demo

使用regexp来处理HTML是一个非常糟糕的主意。上面的黑客不会处理像

这样的东西
  • <tag attribute=">">Hello</tag>
  • <script>if (a < b) alert('Hello>');</script>

更好的方法是使用例如Jsoup。要从字符串中删除所有标记,您可以执行Jsoup.parse(html).text()

答案 1 :(得分:9)

使用HTML解析器。这是一个Jsoup示例。

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

结果:

some text another text

或者如果你想保留换行符:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
    String stripped = Jsoup.parse(line).text();
    System.out.println(stripped);
}

结果:

some text
another text

Jsoup也提供更多优势。您可以使用select()方法轻松提取HTML文档的特定部分,该方法接受类似jQuery的CSS选择器。它只要求文档在语义上良好。自1998年以来被弃用<font>标签的存在已经不是一个很好的指示,但如果你事先知道深度细节的HTML结构,它仍然是可行的。

另见:

答案 2 :(得分:4)

您可以使用名为Jericho Html解析器的HTML解析器。

您可以从此处下载 - http://jericho.htmlparser.net/docs/index.html

Jericho HTML Parser是一个java库,允许分析和处理HTML文档的各个部分,包括服务器端标记,同时逐字地再现任何无法识别或无效的HTML。它还提供高级HTML表单操作功能。

格式错误的HTML的存在不会干扰解析

答案 3 :(得分:3)

从aioobe的代码开始,我尝试了更大胆的东西:

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);

剥离每个HTML标记的代码如下所示:

public class HtmlSanitizer {

    private static String pattern;

    private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};

    static {
        StringBuffer tags = new StringBuffer();
        for (int i=0;i<tagsTab.length;i++) {
            tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
            if (i<tagsTab.length-1) {
                tags.append('|');
            }
        }
        pattern = "</?("+tags.toString()+"){1}.*?/?>";
    }

    public static String sanitize(String input) {
        return input.replaceAll(pattern, "");
    }

    public final static void main(String[] args) {
        System.out.println(HtmlSanitizer.pattern);

        System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
    }

}

为了符合Java 1.4,我写了这个,出于某些可悲的原因,所以随意使用每个和StringBuilder ......

优点:

  • 您可以生成要删除的标记列表,这意味着您可以保留所需的标记
  • 您可以避免剥离不是HTML标记的内容
  • 你保留空白

缺点:

  • 您必须列出要从字符串中删除的所有HTML标记。这可能很多,例如,如果你想剥离一切。

如果你看到任何其他缺点,我真的很高兴知道它们。

答案 4 :(得分:2)

如果您使用Jericho,那么您只需使用以下内容:

public String extractAllText(String htmlText){
    Source source = new Source(htmlText);
    return source.getTextExtractor().toString();
}

当然,即使使用Element

,您也可以这样做
for (Element link : links) {
  System.out.println(link.getTextExtractor().toString());
}