使用Java正则表达式自定义HTML img标记

时间:2013-09-24 22:07:14

标签: java html regex

我是正则表达式的新手,但我相信这是我解决方案的方法。我正在尝试使用任意HTML代码段并自定义图像标记。例如,

如果我有这个HTML代码: <><><><><img src="blah.jpg"><><><><><><><><img src="blah2.jpg"><><><>

我想把它变成: <><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>

我现在的代码是:

Pattern p = Pattern.compile("<img.*src=\".*\\..*\"");
Matcher m = p.matcher(htmlString);
boolean b = m.find();

String imgPath = "src=\"images/";

while(b)
{
    //Get file name.
    String name="test.jpg\"";

    //Assign new path.
    m.group().replaceAll("src=\".*\"",imgPath+name);
}

4 个答案:

答案 0 :(得分:8)

Regular expressions are not the correct way to parse HTML.不要这样做。这是不可能正确的。

Use a proper parser.

Document doc = Jsoup.parse(someHtml);
Elements imgs = doc.select("img");
for (Element img : imgs) {
    img.attr("src", "images/" + img.attr("src")); // or whatever
}

doc.outerHtml(); // returns the modified HTML

答案 1 :(得分:3)

这段代码几乎是完美的。它打印出很多信息,因此查找“最终结果”和“原始”的位置,以查看自定义IMG标签的结果。有一个小缺陷,我仍然不确定如何解决。 “in10”是用于测试输入字符串的变量。其余的是正则表达式。

我注意到当我使用换行符并且“src =”留空而不是“src = \”\“”或“src =''时出现问题。引号似乎影响结果。

private static String r16 = "(?s)(<img.*?)(src\\s*?=\\s*?(?:\"|').*?(?:\"|'))";
private static String in10 = "<><><><><img width=1 height=888 src=\"bnm.jpg\"<><><><><img src=\"\"> <img src = \"\"><img src ='folder1/folder2/bnm.jpg'><><><img src =\"'>";
private static String r14 = "(?s)\\/|\\=";




    String path="images/";
    String name="";

   Pattern p = Pattern.compile(r16);

   Matcher m = p.matcher(in10); 


   StringBuffer sb = new StringBuffer();
   int i=1;
   while(m.find())
   {
        String g0 = m.group();
        String g2 = m.group(2);
        System.out.println("Main group"+i+":"+g0);
        System.out.println("Inner group1:"+m.group(1));
        System.out.println("Inner group2:"+g2);




            String[] names=g2.split(r14);
            printNames(names);

            /*
             * src="/folder1/folder2/blah.jpg"  --->  blah.jpg
             * src="bnm.jpg"                    --->  src="bnm.jp"
             */

            if(names.length>=1)
            {
                name = names[names.length-1];
            }
            else
            {
                name = "";
            }
        //Name might be empty string.
        name = name.replaceAll("\"|'","");
        System.out.println("Retrieved Name:"+name);
        m.appendReplacement(sb,"$1src=\""+path+name+"\"");
        i++;
   }
   m.appendTail(sb);
    INPUT=sb.toString();
   System.out.println("Final Result:"+INPUT);
   System.out.println("Original____:"+in10);
   System.out.println("Count:"+m.groupCount());        
}

答案 2 :(得分:0)

你不应该使用正则表达式.josh3736说的方式很健壮。但是如果你想使用正则表达式你应该使用:

String s = "<><><><><img src=\"blah.jpg\"><><><><><><><><img src=\"blah2.jpg\"><><><>";
s = s.replaceAll("(?<=img src=\")([^\"]+)(?=\">)","images/$1");
System.out.println(s);

输出:

<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>

答案 3 :(得分:0)

虽然我同意其他人用正则表达式这样做是修改html片段的错误方法,但这里有一个JUnit测试用例,它展示了如何用Java中的Pattern替换src元素:

import static org.junit.Assert.*;
import static org.hamcrest.CoreMatchers.*;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

import org.junit.Test;

public class ImgSrcReplace {

  @Test
  public void replaceWithRegex() {
    String dir = "image/";
    String htmlFragment = "<body>\n"+
    "<img src=\"single-line.jpg\">"+
    "<img src=\n"+
    "\"multiline.jpg\">\n"+
    "<img src='single-quote.jpg'><img src=\"broken.gif\'>"+
    "<img class=\"before\" src=\"class-before.jpg\">"+
    "<img src=\"class-after.gif\" class=\"after\">"+
    "</body>";


    Pattern replaceImgSrc =
      Pattern.compile(
        "(<img\\b[^>]*\\bsrc\\s*=\\s*)([\"\'])((?:(?!\\2)[^>])*)\\2(\\s*[^>]*>)",
        Pattern.CASE_INSENSITIVE&Pattern.MULTILINE);

    String result = 
      replaceImgSrc.matcher(htmlFragment)
        .replaceAll("$1$2"+Matcher.quoteReplacement(dir)+"$3$2$4");

    assertThat("the single line image tag was updated", result, 
      containsString("image/single-line.jpg"));
    assertThat("the multiline image tag was updated", result, 
      containsString("image/multiline.jpg"));
    assertThat("the single quote image tag was updated", result, 
      containsString("image/single-quote.jpg"));
    assertThat("the broken gif was ignored.", result, 
      containsString("\"broken.gif'"));
    assertThat("attributes before are preseved.", result, 
      containsString("<img class=\"before\" src=\"image/class-before.jpg\">"));
    assertThat("attributes after are preseved.", result, 
      containsString("<img src=\"image/class-after.gif\" class=\"after\">"));
  }

}