我是正则表达式的新手,但我相信这是我解决方案的方法。我正在尝试使用任意HTML代码段并自定义图像标记。例如,
如果我有这个HTML代码:
<><><><><img src="blah.jpg"><><><><><><><><img src="blah2.jpg"><><><>
我想把它变成:
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
我现在的代码是:
Pattern p = Pattern.compile("<img.*src=\".*\\..*\"");
Matcher m = p.matcher(htmlString);
boolean b = m.find();
String imgPath = "src=\"images/";
while(b)
{
//Get file name.
String name="test.jpg\"";
//Assign new path.
m.group().replaceAll("src=\".*\"",imgPath+name);
}
答案 0 :(得分:8)
Regular expressions are not the correct way to parse HTML.不要这样做。这是不可能正确的。
Document doc = Jsoup.parse(someHtml);
Elements imgs = doc.select("img");
for (Element img : imgs) {
img.attr("src", "images/" + img.attr("src")); // or whatever
}
doc.outerHtml(); // returns the modified HTML
答案 1 :(得分:3)
这段代码几乎是完美的。它打印出很多信息,因此查找“最终结果”和“原始”的位置,以查看自定义IMG标签的结果。有一个小缺陷,我仍然不确定如何解决。 “in10”是用于测试输入字符串的变量。其余的是正则表达式。
我注意到当我使用换行符并且“src =”留空而不是“src = \”\“”或“src =''时出现问题。引号似乎影响结果。
private static String r16 = "(?s)(<img.*?)(src\\s*?=\\s*?(?:\"|').*?(?:\"|'))";
private static String in10 = "<><><><><img width=1 height=888 src=\"bnm.jpg\"<><><><><img src=\"\"> <img src = \"\"><img src ='folder1/folder2/bnm.jpg'><><><img src =\"'>";
private static String r14 = "(?s)\\/|\\=";
String path="images/";
String name="";
Pattern p = Pattern.compile(r16);
Matcher m = p.matcher(in10);
StringBuffer sb = new StringBuffer();
int i=1;
while(m.find())
{
String g0 = m.group();
String g2 = m.group(2);
System.out.println("Main group"+i+":"+g0);
System.out.println("Inner group1:"+m.group(1));
System.out.println("Inner group2:"+g2);
String[] names=g2.split(r14);
printNames(names);
/*
* src="/folder1/folder2/blah.jpg" ---> blah.jpg
* src="bnm.jpg" ---> src="bnm.jp"
*/
if(names.length>=1)
{
name = names[names.length-1];
}
else
{
name = "";
}
//Name might be empty string.
name = name.replaceAll("\"|'","");
System.out.println("Retrieved Name:"+name);
m.appendReplacement(sb,"$1src=\""+path+name+"\"");
i++;
}
m.appendTail(sb);
INPUT=sb.toString();
System.out.println("Final Result:"+INPUT);
System.out.println("Original____:"+in10);
System.out.println("Count:"+m.groupCount());
}
答案 2 :(得分:0)
你不应该使用正则表达式.josh3736说的方式很健壮。但是如果你想使用正则表达式你应该使用:
String s = "<><><><><img src=\"blah.jpg\"><><><><><><><><img src=\"blah2.jpg\"><><><>";
s = s.replaceAll("(?<=img src=\")([^\"]+)(?=\">)","images/$1");
System.out.println(s);
输出:
<><><><><img src="images/blah.jpg"><><><><><><><><img src="images/blah2.jpg"><><><>
答案 3 :(得分:0)
虽然我同意其他人用正则表达式这样做是修改html片段的错误方法,但这里有一个JUnit测试用例,它展示了如何用Java中的Pattern替换src元素:
import static org.junit.Assert.*;
import static org.hamcrest.CoreMatchers.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import org.junit.Test;
public class ImgSrcReplace {
@Test
public void replaceWithRegex() {
String dir = "image/";
String htmlFragment = "<body>\n"+
"<img src=\"single-line.jpg\">"+
"<img src=\n"+
"\"multiline.jpg\">\n"+
"<img src='single-quote.jpg'><img src=\"broken.gif\'>"+
"<img class=\"before\" src=\"class-before.jpg\">"+
"<img src=\"class-after.gif\" class=\"after\">"+
"</body>";
Pattern replaceImgSrc =
Pattern.compile(
"(<img\\b[^>]*\\bsrc\\s*=\\s*)([\"\'])((?:(?!\\2)[^>])*)\\2(\\s*[^>]*>)",
Pattern.CASE_INSENSITIVE&Pattern.MULTILINE);
String result =
replaceImgSrc.matcher(htmlFragment)
.replaceAll("$1$2"+Matcher.quoteReplacement(dir)+"$3$2$4");
assertThat("the single line image tag was updated", result,
containsString("image/single-line.jpg"));
assertThat("the multiline image tag was updated", result,
containsString("image/multiline.jpg"));
assertThat("the single quote image tag was updated", result,
containsString("image/single-quote.jpg"));
assertThat("the broken gif was ignored.", result,
containsString("\"broken.gif'"));
assertThat("attributes before are preseved.", result,
containsString("<img class=\"before\" src=\"image/class-before.jpg\">"));
assertThat("attributes after are preseved.", result,
containsString("<img src=\"image/class-after.gif\" class=\"after\">"));
}
}