方法不会替换所有标记

时间:2015-01-22 09:42:07

标签: java regex

我有一个包含2个方法的类,其中一个生成一个匹配禁用的html标签的正则表达式,另一个用于扫描这些标签的给定输入字符串:

private Pattern pattern;

private List<Pattern> generatePatterns(String[] blacklist)
{
    List<Pattern> deleteList = new ArrayList<Pattern>();
    for (String s : blacklist)
    {
        pattern = Pattern.compile("(?i)<((\\s|/)|(\\s,/))*?" + s + ".*?>");
        deleteList.add(pattern);
    }

    return deleteList;
}

public String cleanHTML(String unsafe, String[] blacklist)
{
    try
    {
        List<Pattern> gp = generatePatterns(blacklist);
        BufferedReader br = new BufferedReader(new StringReader(unsafe));
        String s;
        StringBuilder builder = new StringBuilder();

        while ((s = br.readLine()) != null)
        {
            builder.append(s);
        }

        for (Pattern p : gp)
        {
            Matcher mat = p.matcher(builder.toString());
            if(mat.find()){
                builder.replace(mat.start(), mat.end(), "");
            }

        }
        return builder.toString();
    } catch (Exception e)
    {
        ...
    }


}

我已经使用此输入进行了测试:

String[] blacklist = new String[]
    { "img", "a", "script", "svg", "style", "link", "meta", "noscript", "code", "span", "div", "iframe", "object", "video", "source", "map", "area",
            "form", "onclick", "button" };
    String unsafe = "<p class='p1'>paragraph</p><img></img><Img><Script><Svg><a href><style><link><meta><noscript><code>"
            + "<span><div><iframe><object><video><audio><source><map><area><form><onclick><button>"
            + "< no html > <A href='#'>Link</A> <![CDATA[<sender>John Doe</sender>]]><a link=''>other link</a>";

但输出是:

<p class='p1'>paragraph</p></img><Img><audio>< no html > <A href='#'>Link</A> <![CDATA[<sender>John Doe</sender>]]><a link=''>otherlink</a>

所以它基本上匹配了大多数黑名单标签,但不是全部。并且由于某种原因,它确实只替换了3个标签中的1个。我很确定它与我的正则表达式有关,即使它以前完美无瑕,现在它不再与结束标签(例如)匹配,并且由于某种原因它不会替换和标记。


如果我改变那样的方法:

public String cleanHTML(String unsafe,String[] taglist){

        List<Pattern> gp = generatePatterns(taglist);

        for (Pattern p : gp)
        {
            Matcher mat = p.matcher(unsafe);
            unsafe = ((mat.find()) ? mat.replaceAll("") : unsafe);          

        }
        return unsafe;

}

它有效,可能与缓冲的字符串有关吗?虽然它取代了大多数标签。让我疯了。

1 个答案:

答案 0 :(得分:1)

永远不要使用Regex来解析HTML,HTML结构会变得非常复杂,并且要使用HTML来实现完美的正则表达式并不容易。我建议使用像jsoup

这样的HTML解析器库

你可以删除像这样的标签

Document document = Jsoup.parse(html);
document.select("img").unwrap(); //removes all <img> tags
document.select("p, a, img").unwrap(); //remove multiple tags